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[57] ABSTRACT 

An improved method and apparatus is used for storing and 
delivering information over the Internet and using Internet 
technologies. According to one embodiment of the present 
invention, a method and apparatus for maintaining statistics 
on a server is disclosed. According to an altera ative 
embodiment, a method and apparatus is disclosed for pre- 
dicting data that a client device may request from a server on 
a network. In another embodiment of the present invention, 
a method and apparatus is disclosed for managing band- 
width between a client device and a network. According to 
yet another embodiment, a method and apparatus is dis- 
closed for validating a collection of data. According to yet 
another embodiment, a method for providing notification to 
clients from servers is disclosed. 

5 Claims, 10 Drawing Sheets 
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DOWNLOADING BASED ON THE METRICS DATA FROM THE SERVER 
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CLIENT DEVICE 



STORING THE DATA IN A CACHE ON THE CLIENT DEVICE 
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METHOD AND APPARATUS FOR STORING Internet Given the exponential rate of growth of data on the 

AND DELIVERING DOCUMENTS ON THE Internet, this type of searching is becoming increasingly 

INTERNET cumbersome. 

While the pull model is effective for finding information, 

FIELD OF THE INVENTION 5 oncc a user nas found an information source — a location 

The present invention relates to the field of Internet and hom whicn subse q uent information of interest to the user 

wide-area networking technology. Specifically, the present wdl be debuted— he or she must continue to check for 

invention relates to the storage and delivery of information new information periodically. In the "pull" model, the server 

over the Internet and using Internet technologies. 1fl * rahereatly pasS ive and the client does all the work of 

10 initiating requests. If the server has new information of 

DESCRIPTION OF RELATED ART interest to the client, the server has no method of delivering 

either the information or a notification to the client that the 

The World Wide Web (the Web) represents all of the information exists. The coatent provider cannot, in the pull 

computers on the Internet that offer users access to infor- model, provide an "information service" where active server 

mation on the Internet via interactive documents or Web 15 information is identified, then passed to the client in terms 

pages. Web information resides on Web servers on the of some kind of notification. 

Internet or within company networks. Web client machines It is therefore an object of the present invention to provide 

running Web browsers or other internet software can access a metnod t0 manage passive m& active data throughout the 

these Web pages via a communications protocol known as net work, and offer an improved method and apparatus for 

HyperText transport protocol (HTTP). With the proliferation 20 storing md de ii ver ing information on the Internet, 
of information on the Web and information accessible in 

company networks, it has become increasingly difficult for SUMMARY OF THE INVENTION 
users to locate and effectively use this information. As such, 

the mode of storing, delivering, and interacting with data on The present invention discloses an improved method and 

the Internet, and the Web in particular, has changed over 25 apparatus for storing and delivering information over the 

time. Internet and using Internet technologies. According to one 

RG. 1A illustrates a typical Internet configuration com- embodiment of the present invention a method for 

prising client 100 and content provider 102 coupled via the *<Lnaguig and validating a collection of data that may be 

Internet. The content provider may include a media distnbuted in caches throughout a network !s disclosed, 

company, a consumer service, a business supplier, or a 30 According to an alternate embodiment, a method and app a- 

corporate information source inside the company's network. r A atus * or maintaining statistics on a server is disclosed 

jL . ... . , , t According to an alternative embodiment, a method and 

The use of information within a wide-area network such apparanls is disced &r predicting data that a client device 

as the Internet poses prob ems not usually experienced id f from a seryer Qn a ne(worfc [n embodi . 

smaller, local-area networks. The latency of the Internet 3J men , of ^ n( invendon) a method ^ ^itus is 

produces delays that become the performance bottleneck in for managing bandwidth between a c i ieQt device 

retrieving information. Clients may be connected to the ^ a netWQlk t0 t another embodiment, a 

network only part of the tune but stnl want access to me(hodanda p paratusisdisc i OS6d f orva Hdating a collection 

mformation from their local platform that was retrieved of data Accordi t0 , another embodinl e n t, a method for 

from the conteot prov.der prior to being disconnected. The idjn notification l0 clients from serV6rs j, Closed, 

granularity and ^dependence of the objects in a wide-area otner objects> featufes aQd advant of the t mven . 

netoork particularly the Internet, make the task of aggre- , ion ^ bc t f(om (he accompanying Mn&i and 

gatmg them more difficult. &om ^ detailed description . 

The use of client and intermediate caching of the content 

provider information may alleviate some of the problems of 45 BRIEF DESCRIPTION OF THE DRAWINGS 
the wide-area network interactions. Certain implementations 

today perform this caching on behalf of the client, but ^ P resent invention is illustrated by way of example 

sacrifice data timeliness and do not address performance md not b y wa y of limitation in the figures of the accompa- 

problems because they must validate their caches in single n y in S Swings in which like reference numerals refer to 

operations over the network. 50 similar elements and in which: 

FIG. 1A also illustrates the typical Internet configuration FIG - 1A illustrates a typical Internet configuration, 

of client-to-content-provider interaction. Subsequent to con- FIG. IB illustrates a typical computer system in which the 

necting to the Internet, client 100 will generally request present invention operates. 

objects from the content provider 102. The client must locate FIG. 2 illustrates the three major components according to 

the information, often through manual or automatic 55 one embodiment of the present invention, 

searches, then retrieve the data through the client. FIG. 3 illustrates an overview of how the three compo- 

When searching for the data initially, this "pull" model nents 0 f one embodiment of the present invention interact 

provides great utility in locating information. Implicit in the vvith each other. 

m °^ e1 ' ^T^' * ^ ' he cl \ en \ mac ^ h « 1 lh ? P°»- FIGS. 4-6 are flow charts illustrating embodiments of the 

sibility for finding and downloading data as desired. The 60 „ 4 • T 

■ r j -.n-fL Li fL P , „. . f present invention, 

user is faced with the problem of having to scour the Web for . , . . 

various information sites that may be of interest to him or . . FIG. 7A illustrates a regular expression used as a positive 

her. Although this model provides a user with a large degree hlicf for bnks 0D a P a S e ' 

of flexibility in terms of the type of information that he or 75 illustrates a regular expression used as a "nega- 

she would like to access each time he or she connects to the 65 ^ filter" for links on a page. 

Internet, there is clearly a downside to the model in that the FIG. 8 is a flow chart illustrating one embodiment of the 

user is forced to constantly search for information on the present invention. 
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DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

The present invention relates to a method and apparatus 
for storing and delivering documents on the Internet. In the 
following detailed description, numerous specific details are 
set forth in order to provide a thorough understanding of the 
present invention. It will be apparent to one of ordinary skill 
in the art however, that these specific details need not be 
used to practice the present invention. In other instances, 
well known structures, interfaces and processes have not 
been shown in detail in order not to unnecessarily obscure 
the present invention. 

FIG. IB illustrates a typical computer system 100 in 
which the present invention operates. One embodiment of 
the present invention is implemented on a personal computer 
architecture. It will be apparent to those of ordinary skill in 
the art that other alternative computer system architectures 
may also be employed. 

In general, such computer systems as illustrated by FIG. 
IB comprise a bus 101 for communicating information, a 
processor 112 coupled with the bus 101 for processing 
information, main memory 103 coupled with the bus 101 for 
storing information and instructions for the processor 112, a 
read-only memory 104 coupled with the bus 101 for storing 
static information and instructions for the processor 112, a 
display device 105 coupled with the bus 101 for displaying 
information for a computer user, an input device 106 
coupled with the bus 101 for communicating information 
and command selections to the processor 112, and a mass 
storage device 107 coupled with the bus 101 for storing 
information and instructions. A data storage medium 108, 
such as a magnetic disk and associated disk drive, containing 
digital information is configured to operate with mass stor- 
age device 107 to allow processor 112 access to the digital 
information on data storage medium 108 via bus 101. 

Processor 112 may be any of a wide variety of general 
purpose processors or microprocessors such as the PEN- 
TIUM® brand processor manufactured by INTEL® Corpo- 
ration. It will be apparent to those of ordinary skill in the art, 
however, that other varieties of processors may also be used 
in a particular computer system. Display device 105 may be 
a liquid crystal device, cathode ray tube (CRT), or other 
suitable display device. Mass storage device 107 may be a 
conventional hard disk drive, floppy disk drive, CD-ROM 
drive, or other magnetic or optical data storage device for 
reading and writing information stored on a hard disk, a 
floppy disk, a CD-ROM a magnetic tape, or other magnetic 
or optical data storage medium. Data storage medium 108 
may be a hard disk, a floppy disk, a CD-ROM, a magnetic 
tape, or other magnetic or optical data storage medium. 

In general, processor 112 retrieves processing instructions 
and data from a data storage medium 108 using mass storage 
device 107 and downloads this information into random 
access memory 103 for execution. Processor 112, then 
executes an instruction stream from random access memory 
103 or read-only memory 104. Command selections and 
information input at input device 106 are used to direct the 
flow of instructions executed by processor 112. Equivalent 
input device 106 may also be a pointing device such as a 
conventional mouse or trackball device. The results of this 
processing execution are then displayed on display device 
105, 

Computer system 100 includes a network device 110 for 
connecting computer system 100 to a network. The network 
device 110 for connecting computer system 100 to the 
network includes Ethernet devices, data modems and ISDN 
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adapters. It will be apparent to one of ordinary skill in the art 
that other network devices may also be utilized. 

The preferred embodiment of the present invention is 
implemented as a software module, which may be executed 
5 on a computer system such as computer system 100 in a 
conventional manner. Using well known techniques, the 
application software of the preferred embodiment is stored 
on data storage medium 108 and subsequently loaded into 
and executed within computer system 100. Once initiated, 
10 the software of the preferred embodiment operates in the 
manner described below. 

1. Introduction 

The presently claimed invention improves use of the Web 
and wide-area networks by managing groups of network 

15 objects (content or applications) and bringing that content 
and notifications from servers directly to desktops in a 
timely fashion and while consuming a minimal amount of 
desktop screen space. Users subscribe to "channels", which 
automatically bring new content to the user's machine and 

2Q render information in summary form. Detailed content 
attached to the subscription is rendered in the user's web 
" browser, and automatically pre-fetched by one embodiment 
of the present invention. 

Channel configuration and rendition, as well as related 

M content pre-fetch are all controlled by default from the 
content provider via a back-end server, thus giving content 
providers a large amount of control over their own data and 
how it is presented. Content providers can supply their own 
graphics, advertising, ticker information, animation control, 

30 and content refresh parameters. Subscribing to channels is 
also simple: content providers simply place an icon on one 
or more pages in their site and clicking the icon causes a 
subscription to be created on the client side and/or the server 
side. The subscription is then updated automatically. 

35 According to one embodiment, the presently claimed 
invention plays a unique role as an intermediary and media- 
tor between end users, namely consumers of information, 
and content providers, namely producers of information. 
One embodiment of the present invention provides func- 

40 tionality that gives to the content provider control over their 
brand, the way in which their information is presented, and 
the way in which users access their web site. At the same 
time, users are given the ability to pick and choose the 
information they want from the sites they want, and presen- 

45 tation of that information is accelerated, thus improving the 
user's web experience. 

An intelligent caching infrastructure that uses information 
(called "Meta-Data") about the content to control client and 
intermediate caches reduces the wide-area networking prob- 

50 lems generally attributed to interactive content by allowing 
the caches to manage expiration, compaction, bulk-delivery 
and other operations guided via Meta-Data from the content 
provider. Tne caches become intelligent delivery nodes at 
the client, and within the network, because they are able to 

55 understand the important properties of the information they 
are managing. 

The presently claimed invention also provides a notifica- 
tion system to the content provider and user that spans 
Internet, firewall, and internal network systems, and can 

60 combine various underlying transports and notifications into 
a general notification architecture. This offers content pro- 
viders (server) the opportunity to provide pro-active infor- 
mation service to the user. The user may receive various 
types of internet and internal, company notifications and 

65 information. 

When the notification and "active information" push 
model is combined with an intelligent caching network 
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infrastructure, in the presently claimed invention, the user of the display, or dock them to a web browser. When docked 

achieves the highest degree of functionality, performance, to one of the screen edges, the content bar can be configured 

and usability. As described in further detail below, the to auto-hide, so that it appears only when the mouse is 

presently claimed invention spans client and server placed on the edge of the screen. 

machines, creating a system that allows feedback from the s Each content bar contains one or more channels, namely 

client to the back-end server, and subsequent optimization of areas of the content bar that belong to a particular content 

the client by the back-end server. provider. Channel look and feel is under the control of the 

2. Components content provider via open data formats (MIME, HTTP, 

The presently claimed invention consists of three major standard image formats such as GIF and JPEG). The content 

components: the content bar, caching server and back-end 10 bar provides a common rendering environment for the 

server, as illustrated in FIG. 2. The content bar 302 and the channels so they can be moved, resized, or locally config- 

caching server 304 reside on one or more end-user comput- ured by the user in a standard way. 

ers 100 ("client machines") owned by information subscrib- Each channel contains one or more "subscriptions." Sub- 

ers. The back-end server 350 resides one or more server- scriptions are agents configured to retrieve information at 

class computers owned by information publishers ("back- 15 various times, or to process asynchronous notifications of 

end server machines"). The content bar 302 and caching incoming data. Initial configuration is set up by the content 

server 304 are logical components. Each component can be provider according to the content provider's publishing 

implemented as separate processes or within a single pro- schedules. The user can change this information as well as 

cess. add to it. Each channel subscription uses a notification 

The client machines 100 and the back-end server 20 mechanism to retrieve new data the notification mechanism 

machines communicate over a network such as the Internet can be simple polling, or more complex asynchronous event 

or a corporate intranet. The communication mechanism notification mechanisms, as described later in the document, 

includes open standard protocols such as HTTP (Hypertext 2.2 Caching server 

Transfer Protocol), MIME (Multipurpose Internet Mail The caching server manages all of the user's interaction 

Extensions) and TCP/IP (Transmission Control Protocol/ with the web. AU web requests, including those generated by 

Internet Protocol). the user's browser and those generated by channel 

One embodiment of the present invention is implemented subscriptions, go through the caching server. The caching 

in the caching server 304. An alternate embodiment is server is responsible for the following areas of functionality, 

implemented in a combination of the caching server 304 and 3Q either alone or in concert with one or more publisher 

the back-end server 350. The content bar 302 in either back-end servers. These areas of functionality are described 

embodiment is a rendering environment for published con- in further detail later in this document: 

tent. Although the following sections describe a two-tier Intelligent cache management, including local algorithms 

client-server architecture, the presently claimed invention for automatic expiration management and content 

may also be implemented according to other architectures. 35 compaction, and algorithms shared with the back-end server 

For example, while the content bar 302 always resides on a for custom expiration management, 

subscriber's client machine 100 and the back-end server 350 statistics collection and upload to back-end servers 

always resides on a publisher's back-end server machine, r , . , r* u * j 1 11 -*u 

J . u r-* j-* u- Lookahead pre- fetch of content based on local algorithms 

there can be any number of intermediate cachmg servers A , . . . c c lij 

u *, , *u u -u j ui* u and on custom control information from back-end servers 

between the subscriber and the publisher. n „ 1M . „ , , . „ . 

t, ,. , 4U , *u r * Bulk vahdation of content based on information (meta- 

The caching servers do not have to reside on the client , . x c . , . v 

, . ™. - - .. . n . * , A data; trom back-end servers 

machine. This configuration typically provides the best , „. . A . „ , 

performance, however, by taking advantage of local disk Intelligent bandwidth management, allowing user 

access speed to increase performance. Caching servers can re ^ ests P riorit y over background lookahead pre-fetch 

be deployed around the network to balance network load and 45 rc^s* 5 - 

provide concentration of frequently used information. In this Registration and subscription by the user to information 

embodiment, each caching server 304 is implemented as a sources. 

standard HTTP proxy server thus allowing them to be Handling of incoming channel subscription notifications, 

coupled together hierarchically. FIG. 3 illustrates an over- removing the need for the caching server to poll its content 

view of how the three components of one embodiment of the 50 providers for new information. 

present invention interact with each other. A subscriber's A caching server can operate on its own behalf or a 

requests to retrieve certain published content are managed community of many users, who in turn have caching servers 

by subscription manager 306 which resides on the subscrib- on their own machines. These higher-level caching servers 

er's client machine. Subscription manager 306 communi- can be placed at intranet/Internet boundaries to provide 

cates with Web browser 100 on the client machine to 5S information concentration and conserve network bandwidth, 

demand the requested content. Web browser 100 then sends 2.3 Back end 

an HTTP request to a remote caching server 204. In The back end server is a collection of software that works 

response, caching server 204 either retrieves cached content with client caching servers to optimize use of a publisher's 

from cache 300 or sends an HTTP request via the Internet to site by its subscribers. According to one embodiment, each 

a publisher's machine to retrieve non-cached content. 60 publisher has a back-end server that controls use of its 

2.1 Content Bar content by clients and feeds information extracted from the 

The end user interacts with the content bar. According to clients back to the publisher. The back-end server is respon- 

one embodiment, the content bar is responsible for rendering sible for the following areas of functionality: 

channel subscription content, configuring each subscrip- Maintenance of cache control meta-data. According to 

uon's look and feel, and managing the user's interaction 65 one embodiment, this data is provided to caching servers 

with the web. The user can create several content bars, and which use the data to control the way in which the content 

can float them on the desktop, dock them to one of the edges provider's information is cached. 
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Generation of subscription data and subsequent publish- 
ing of that data for retrieval by caching servers, or subse- 
quent sending of that data directly to the caching servers. 

Creation and maintenance of bulk-validation information. 
Bulk-validation data is initially created by the content pro- 5 
vider and sent from the back-end server to caching servers, 
allowing them to validate the publisher's cached content 
efficiently. 

Creation and maintenance of lookahead information. As 
above, the information is initially created by the content 10 
provider. As it is sent to caching servers and used by those 
servers, updated information is uploaded from the servers 
and used to fine-tune the lookahead information. The result 
is a feedback loop that tunes lookahead based on client use 
of the publisher's content. 15 

Generation of content and subscription usage reports. The 
back-end server uses statistics uploaded from caching serv- 
ers to give publishers an accurate picture of how their site is 
used, including for example extremely accurate advertise- 2Q 
ment display- and click-through counts. 

3. Shared Technology 

This section details the technology that is shared by the 
back-end server and the caching server. According to one 
embodiment, the interaction between these two components 25 
provides configuration flexibility and efficient performance. 
By locating acceleration information at the publisher, the 
creation of that data is placed in the hands of the people most 
likely to know how to manage it, namely the publisher. By 
then downloading that information to caching servers, the 30 
system allows a site to be accelerated according to the 
wishes of the site owner, who is in the best position to know 
how to do this. 

According to one embodiment, the system further intro- 
duces a feedback loop so that the publisher is in a position 3S 
to retrieve accurate information about their site from their 
subscribers 1 caching servers. That information can be used in 
its own right to control advertising rates or provide reports 
of various sorts. The information can also be fed back into 
the information passing back to the caching servers, further 40 
enhancing their ability to accelerate the site. 

3.1 Bulk Validation 

3.1.1 Overview 

Once a cached piece of content expires, the caching server 45 
must validate it. This task involves sending a request to the 
content's owner, and the content owner either responds that 
the content has not changed or provides the latest version of 
the content. In the latter case, the caching server still 
experiences the overhead of retrieving the content. In the 50 
former case, however, the server can increase performance 
significantly by using bulk validation. With bulk validation, 
a single request to the owning server results in large amounts 
of content being automatically validated or invalidated. 
Content that is invalidated is marked as expired in the cache, 55 
and the next time the caching server is asked for that content, 
the caching server knows to retrieve that content from the 
content owner. Content that is still valid has its expiration 
date extended and continues to be served from the cache 
until the content finally expires. 60 

Bulk validation revolves around the idea that a group of 
content frequently has similar expiration characteristics. In 
a news page, for example, the masthead, footer, and side bars 
may be "boilerplate" that rarely changes, whereas the article 
and its associated images will change constantly. In a given 65 
site, all content that has similar expiration characteristics can 
be grouped into a single list, called a TOC (Table Of 
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Contents). The TOC is an HTML page consisting of a header 
and a body that contains tags describing TOC members. The 
TOC is not intended to be viewed by end users; it is simply 
meta-data shared between the caching server and its back- 
end servers and used to configure the caching server's 
behavior. Meta-data is described in further detail in Section 
3.2 Meta-Data. 

Each TOC member is represented by a single ICPAGE 
HTML tag. The tag contains a number of attributes (TOCs 
are also used for lookahead configuration as described in 
Section 3.4.6, Custom Weight Assignment), including the 
LASTMOD attribute. This attribute contains the member's 
last-modification date in seconds since midnight Jan. 1, 1970 
(standard epoch) and is stored with the member in the cache: 

<ICPAGE 

URL = *http : // truthincommoncom/ library/ achannelshtml* 
LASTMOD = 8354009$ 

The TOC is assigned an expiration date, which automati- 
cally applies to all members of the TOC. The expiration date 
can be assigned explicitly via a standard HTML META tag 
(the standard mechanism for driving HTTP data from 
content), or via any of the custom expiration mechanisms 
defined in Section 3.3, Custom Expiration Control. FIG. 4 is 
a flow chart illustrating an overview of one embodiment of 
bulk validation. 

3.1.2 Client Side Behavior 

According to one embodiment, the caching server is 
designed such that whenever a TOC member is accessed, the 
TOC's expiration date always overrides the member's expi- 
ration date. Whenever the TOC expires, all its members 
automatically expire. Once a TOC expires, bulk validation 
begins. Any time one of its members is accessed, the 
member is noted as expired (it shares its TOC's expiration 
date), which causes the TOC to be retrieved first. The 
caching server receives one of two responses to its request 
for the TOC, just as it would for any other page. The first 
possibility is that the TOC has not changed. If so, then the 
TOC expiration date is updated according to its meta-data if 
present or the expiration algorithm if it is not. Each member 
automatically gets the new expiration date. 

The second possibility is that the TOC has indeed 
changed, in which case the owning server sends the new 
TOC to the caching server. The caching server then parses 
the TOC's HTML stream looking for members. Each mem- 
ber is then looked up in the cache and its last-modification 
date compared with the incoming TOC copy's last- 
modification date for that member. If the dates are the same, 
the content has by definition not changed, and is assigned the 
TOC's new expiration date. If the dates are different, the 
content has changed and is automatically marked as having 
expired. Next time the content is accessed, it will be updated 
from the owning server. 

This behavior results in a significant performance 
improvement. For example, for a TOC with 100 members 
(e.g. the boilerplate graphics for a web site), a single 
operation simultaneously validates all 100 members. With- 
out a TOC, the caching server would have to perform 100 
network operations to validate one member at a time, and 
most of the operations are likely to be useless because the 
content has probably not changed. 

3.1.3 Locating TOCs 

Once an administrator has generated a TOC, the local 
caching server must be able to find the TOC so that the TOC 
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can be loaded. A TO C is just another form of meta-data. How often it is updated by the back-end server software 

TOC pages are therefore located exactly the same way as j^y additional update-related inclusion/exclusion criteria 

site meta data is located, as described in section 3.2 Meta- „ * . j • * i c . . . 

p ata How it is stored in a catalog of TOCs for maintenance by 

administrators 

.1. O Management security attributes that control site administrators' 

According to one embodiment of the present invention, abilities to read, modify, or delete the TOC 

TOCs for a given site are maintained by the back-end server ^ ^ ^ Deletin a TOC 

for that site. Each TOC has a well-known name that iden- * ' 

ufies the TOC uniquely on that site. The collection of TOCs According to one embodiment, deleting a TOC consists of 

on the site comprise a TOC catalog, that can be browsed by 10 rcmovm g lts catalo S cntrv «nd associated generation criteria. 

administrators seeking to create new TOCs, delete TOCs, or ^ °P eratlon is performed by the back-end server, either in 

modify existing TOCs response to a client administration front-end or a back-end 

rn. 4 , , A . - - . server utility. The back-end server uses its authentication 

The TOC catalog also contains access control and authen- nnA *^ , , - , mnt . „ . • , A , 

j . . . - ™ i . . . , . , and access control lntormation to restrict TOC deletion to 

fixation information. The authentication data enables the tU A * . . . . 4 t 

, , , * * . • 15 the appropriate administrators. 

back-end server to verify the identity of administrators * . 

wishing to manage the TOCs in the catalog. The back-end 3.1.7 Modifying a TOC 

server uses the access control information to restrict par- According to one embodiment, modifying a TOC consists 

ticular administration functions to various groups of people. °f lQe following operations: 

Because the back-end server handles TOC management, 20 Addin g one or more member pages 

the actual commands to manage a TOC can be issued either Removing one or more member pages 

from a client machine such as a client PC or from a Modifying a page, changing its last-modification date, 

management interface on the back-end server machine. version stamp, or lookahead weight. 

Several cUent PC-based mechanisms can be used to manage Thc modlficatioos can be performcd manually by an 

TOC^llieimplementoi :can select a client-side mechanism a administratorj or automa tically by the backed server 

based on portability, UI functionality, and chent environ- accordin tQ scheduling and ^di^^d^n criteria 

ment. Following are examples of such mechanisms: spedfied by ^ administratort Mamial updates are most 

Admimstrator channel on the channel bar us&M in assigning custom lookahead weights to member 

Custom version of client with added administrator user pages. Manual updates can be performed either on the 

interface functionality 30 back-end server machine via server utility programs, or on 

HTML forms accessible from any web browser a client pc ™ any of the graphical administration mecha- 

Java apple accessible from any web browser nisms described in section 3 1.4 TOC Management. Auto- 

w ^ jq ^ .«_, - , matic updates are performed by the back-end server accord- 

M.crosoft Windows OCX accessible from any web ^ to ^ cfae(du lizi S ioforxnaUon stored as part of the TOC 

browser 35 definition in the TOC catalog. 

The client-side administration mechanisms are respon- According to orje embodiment, automatic updates begin 

sible for presenting a graphical view of the TOC catalog and ^ , he da , e &s sayin ^ members of tfae current 

its member TOCs to the admmistrator. In addition, the toC. Any TOC members that were manually added by the 

chent-side/admuiisu-ation mechanism may provide some administrator ^ specially marked . ^ update process men 

local editing capability as a means of increasing perfor- 40 retrieves ^ ^ inclusio[l/exc , usion from me 

TOC definition and begins generating a new TOC based on 

3.1.5 Creating a TOC those criteria. For each TOC member included, the update 

According to one embodiment, TOCs are created by the process determines whether the same member was present in 

back-end server in at least one of three ways, namely by the previous version of the TOC. If it was, the previous 

scanning file system directories recursively, by scanning the 45 version's lookahead weight is carried over to the new 

content of HTML pages for links and following those links version, and the previous version is marked as being present 

recursively, or being supplied the TOC information directly in the new version. 

via database or content management system through an Qnce the TOC generation completes, the update process 
Application Programming Interface (API). Files (in the first scans the previous version, copying any administrator- 
method) or pages (in the second method) are included in the created pages that were not already part of the new TOC. 
TOC if they satisfy various criteria, including but not limited Finally, the update process incorporates any additional loo- 
to the following: kahead weight statistics accumulated since the last TOC 
Number of levels to search. When scanning the file update, and then replaces the old version of the TOC with 
system, the number of levels of folders/directories to J5 the new version, 
descend before stopping. When scanning HTML content for 3.1 .8 Generalized TOC Usage 

links, the number of levels of links to follow before stop- A ™« . . f r , « . - 

r A TOC in its general form is a unit of bulk information 

K 6 * management. It describes a set of related Web objects by 
A regular expression which, if matched by the file or URL. The system then defines operations on the set such as 
URL, includes that file or URL in the TOC. 60 ^ bulk validat i 0I1 process described above. Other useful 
A regular expression which, if not matched by the file or bulk operations can also be defined for TOCs. A TOC can 
URL, includes that file or URL in the TOC. describe a set of objects which are to be retrieved in bulk by 
The TOC can be constructed using any number of differ- the caching server for later off-line viewing. ATOC can also 
ent inclusion and exclusion criteria. The above list is simply describe a set of objects to be looked ahead on, indepen- 
a sample of the possibilities. The TOC inclusion/exclusion 65 dently of the lookahead algorithm described later in this 
criteria are stored in the TOC catalog, along with other document. Sets of web objects which share caching prop- 
administrative information such as: erties can also be grouped into a TOC. 
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3.2 Meta-Data 

3.2.1 Overview 

According to one embodiment of the present invention, 
"meta-data" is used to configure much of the behavior of 
caching servers. Content providers are allowed to configure 5 
their sites independently of other sites (or individual sub- 
scriptions independently of other subscriptions, all of whom 
reference the same site) and that data is used to drive the 
behavior of the caching servers. 

The term "site meta-data" is used to cover all meta-data 
that optimizes a particular site. The meta-data is stored in 
HTML tags and can therefore appear anywhere in a site. To 
make administration tractable, the tags are typically grouped 
into a single page, except for TOCs, which are pointed at 15 
from the site meta-data page. 

According to one embodiment of the present invention, a 
site meta-data page is referenced with an ICMETA pointer 
tag: 

20 

<ICMETA URb-http ://www.incommonxom/sitedefs/iiytime8_ 
flitc.html> 

There are a number of ways that the caching server can 
locate meta-data pages. According to one embodiment, the 
site administrator can add to every page on their site an 25 
ICMETA tag identifying the meta-data page. Whenever the 
caching server encounters an ICMETA tag in an HTML 
page, it fetches the URL pointed at by the tag. This scheme 
has the advantage that the meta-data page gets loaded 
immediately whenever a page from the site is loaded. 30 

A variation on the above scheme uses ICMETA tags on 
only a few "strategic" pages, for example the site's home 
page. Many sites place links to an index or home page on all 
other pages. The caching server automatically looks ahead 
on all pages, and if all pages have a link to a strategically- 35 
located page, the caching server quickly encounters the 
ICMETA tag and fetches the meta-data page. 

According to an alternate embodiment, the channel devel- 
oper can tag their subscription notifications with a "meta 
data URL" that the caching server automatically fetches 40 
before it fetches any other channel content. This last mecha- 
nism guarantees that the meta-data page will be loaded every 
time new channel data arrives at the caching server. 

3.2.2 Areas of use 

Meta-data may be used in the following areas to control 45 
caching server behavior: 

client configuration by intranet administrators 
custom expiration control 
lookahead pre -fetch configuration 

statistics upload configuration 50 
Different types of site meta-data are created differently. 
Because meta-data is implemented in standard HTML, 
simple configurations can be created directly by the content 
provider in a standard text or HTML editor. The data can be 
dispensed by the content provider's back-end server at 55 
subscription notification time, or can be loaded by the 
caching server whenever the caching server encounters the 
appropriate ICMETA tag. More complex meta-data that is 
derived from user feedback can initially be generated auto- 
matically and then automatically updated as user feedback is 60 
gathered. The latter mechanism is described in further detail 
in the following sections. Further details of other meta-data 
applications are also described in the following sections. 

3.3 Client Configuration Meta-Data 

Meta-data is used by local network administrators to 65 
configure the client software. The caching server has a 
built-in subscription to an HTML page containing configu- 
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ration meta-data. The publisher of this subscription is the 
local network administrator, and the publishing host is an 
unqualified internet hostname. The caching server uses the 
host name "inC^mmon-Config". The client's network soft- 
ware will automatically qualify this host name in its local 
internet domain, allowing the caching server to find a server 
in any intranet without having to be configured at installa- 
tion time. Note that a server with this name need not be 
dedicated for configuration; it is just as easy to give an 
existing host an alternate (alias) host name of "inCommon- 
Config". 

This configuration mechanism is simple and powerful. It 
allows intranet administrators to configure their clients with- 
out any installation-time work by the user. Because the 
configuration data is received in the form of a subscription 
notification, clients wiU receive any configuration changes 
as soon as they are made. If multicast notification is used, 
only one copy of the new configuration is sent to all clients. 
The mechanism is reasonably secure because the configu- 
ration host name is well-known within the client's local 
internet domain, and the client initiates contact with the 
configuration publisher. Additional security can be imple- 
mented with Secure HTTP and digital signatures to authen- 
ticate the publishing host. 

Notifications for the configuration subscription contain 
configuration directives in the form of HTML tags. The 
caching server interprets and then executes these directives. 
In this manner, the network administrator can control the 
caching server's caching behaviour (e.g. disabling it entirely 
on local area networks where the network speed is faster 
than the disk transfer rate), its proxy server, its default 
lookahead characteristics, even the set of subscriptions it 
currendy uses. Any aspect of the caching server that the 
system designer deems useful to control can be controlled in 
this manner. 

3.4 Custom Expiration Control 

3.4.1 Overview 

According to one embodiment, whenever the caching 
server is asked to retrieve content from the web, the caching 
server places the content in local storage while returning the 
content to the requestor (either the browser, a subscription, 
or the server itself). The server then satisfies subsequent 
requests for the same content from the local storage rather 
than going to the network. This strategy improves perfor- 
mance but incurs a cost, namely, if the content changes at its 
origin, the caching server delivers an old copy of the data 
from local storage, rather than the new copy from the 
Internet. 

The caching server solves this problem by assigning each 
piece of content an expiration date. The server satisfies 
requests for cached content from local storage until the 
expiration date is reached, after which time it checks at the 
origin site to see if the content has changed. Content 
providers may control the expiration behavior of cached 
content in a number of different ways. Flexibility in this area 
is crucial because if a piece of content's expiration date is set 
incorrectly, then the user will either see old data served out 
of the cache, or will see lower performance as the caching 
server validates or retrieves content from the network unnec- 
essarily. 

According to one embodiment of the present invention, 
three mechanisms are used for expiration control, in addition 
to the standard expiration control mechanisms offered by 
HTTP: 

TOC-based expiration control 
ICEXPIRE meta-data expiration control 
automatic expiration control 
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ITie first two types of expiration control are described in 

this section, because they work by sharing information /icexpire host = www.cnn.com 
between a back-end server and a caching server. Automatic 

expiration control is performed entirely by the caching REGEXP = 'http://www.cnn.com/spons/.** 

server and is described in section 4.1 Automatic Expiration S EXHRATION= 'Thu. 30 Jan 1997 11:12:13 GMT) 

Control. ' 

3.4.2 TOC-based Expiration Control 

TOO; can be used for bulk validation and for lookahead *? above , exan ? le > 811 V * U ° n www.cnn.com that are 

weight configuration, as described earlier in this document. "f? s P° rte sf™ *if *° ^7 **P™ on January 30 

Tn « * j • * * • * 1 c . t * _i 10 at the specified time. The expiration date can also be 

TOC-based expiration is simply a way for content providers ified - m scconds rclativc £ the time In the 

to apply a particular expiration date to all members of a TOC fol i owing example, all matching content gets an expiration 

without having to modify the individual members and date 600 sec0 nds from the time it was retrieved: 
without having to use meta-data tags. 

The content provider sets the expiration date of the TOC, J5 (icexpire HOST = wwcmcom 
using any of the methods specified below, or by simply 

specifying an expiration in the content itself using standard REGEXP = "http: // www.cnn.com/sports/." 

HTML META tags or HTTP headers. All members of the EXPIRATION = *+60CT) 
TOC immediately get that TOCs expiration date. 

TOC-based expiration gives content providers fine con- 2 o 

trol over content expiration (in addition to its primary 3 - 4 - 5 Conservative expirations 

benefits of bulk validation and custom lookahead Some sites may not want any of their content cached at all. 

weighting). TOC-based expiration allows content providers These are sites whose content changes rapidly or 

to group URLs with many different syntaxes but similar Z^lvK "^^^^^ short-lived. An 

- . n - , « -rr\r^\ ICEXPIRE whose EXPIRATION attribute is set to the 

expiration behavior into a single construct (the TOC) and 25 , ALWAYS tells the server that all content 
expire them together, where otherwise a large number of 

Torvmnr * j* \ u j A , & , . . At _ satisfying the lookahead tag s match criteria are to always be 

ICEXPIRE meta-data tags would need to be used. Again, the | om ^ network) * nd from tQe cache y only 

goal is flexibility for the content provider. tf mc mn1cni has not changcd> ^ result is poorer pGdo ^_ 

3.4.3 ICEXPIRE Overview mance but greater accuracy. 
Expiration meta-data allows content providers to use 30 

HTML to describe the expiration control behavior they (ICEXPIRE HOST = www.cnn.com 
desire for the caching server when serving their content. 

Content providers may bind regular expressions to various BEGEXP = " http: /; www.cnn.com/spons/." 

expiration control parameters. Any URL that matches the EXPIRATION = ALWAYS) 

regular expression is automatically given an expiration date 35 
according to the parameters associated with the regular 

expression Content providers are thus able to control the behavior of 

Expiration meta-data is defined and maintained by the ™^ out ^^g/f content itself Tlie expi- 

, ,^ , j , * t_- ration meta-data can be applied at an arbitrarily fine granu- 

back-end server and can be sent down to a caching server as T . fll • „ tpcywd c „™ w flvn ,. r ,L ,_f* ^ - 

c . , . . . , . . . , . , 40 lanty by tumng the lCbXPlRE regular expression appropn- 

part of a channel subscription s notification data, or loaded ate ^ 

by the caching server as it encounters ICMETA tags in 345 Libera i expirations 

published content. The ICEXPIRE HTML tag is used to At mc othcr cnd of the S p Cctrum are sites that never want 

implement this functionality, The tag has four attributes, two me ir content to expire. These types of sites contain content 

of which control lookup behavior and the remaining of ^ that appears once (possibly for a very short time) and never 

which define the expiration date. changes. Examples are newspaper or journal articles with 

Regular expression processing is traditionally slow. Given URL identifiers that are never re-used. For this type of 

that caching server performance is extremely important, the content, a special value NEVER is provided for an ICEX- 

ICEXPIRE tag provides a high-speed level of lookup before PIRE EXPIRATION attribute. The value works exactly as 

regular expression matching is performed. The HOST above, except that content matching the lookahead configu- 

attribute defines a host name to which the expiration applies. ration never ex P ires - 
Only those URLs with a matching host name are considered 

for regular expression matching. The host names can be used (icexpire host = www.cnn.com 

as keys in a hash table, providing a first level of high-speed WSEXP B ^ /; www . CBJLCOm/tport8r 
lookup. Once the correct host is found, the server can travel 

through the set of ICEXPIRE regular expressions that apply expiation = never) 
to that host, until a match is found. Each regular expression 

is specified with the REGEXP attribute. Once a match is . . M . . p . . 

found, the expiration control attributes in the tag are applied * ™ immum ^* at i° ns . 

* *l * i.- rrnT a m_ j • *i_ r 11 ■ .I- Expiration dates computed with the automatic expiration 

to the matching URL, as described in the following sections. , n 1 fi A •* j • o *• a 1 a * r« • *■ 

™ . . * . , ... , i_ c j • j 60 algorithm descnbed in Section 4.1 Automatic Expiration 

The remaining two attributes describe a fixed expiration and 1 t . . , . u t* * 

& . A . „ c At * A Control, can sometimes yield non-intuitive results. If. for 

a minimum expiration. The uses of these attributes are m1 u - 4 . , . , . . , Ci . 

, •« * • .1 c -m-m . A . example, an object is retrieved immediately after it is 

described in the following sections. „ J A ./ . . . , A , ; , rf ,. 

& modified, and the object does not have enough lifetime 

3.4.4 Fixed Expirations samples, the resulting new expiration date will be very short, 
The most typical use of ICEXPIRE causes all URLs 65 causing the object to expire earlier than it should. 

matching the regular expression to be assigned a specific Content providers can override an expiration date calcu- 

expiration date: lated by the algorithm with a minimum value. If the expi- 
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ration date computed by the algorithm is below the 
minimum, the minimum is used. The minimum expiration is 
defined in the MINUEXPIRXTION attribute of the ICEX- 
PIRE HTML tag. It can be specified as a time in seconds 
relative to the time it was retrieved, or as an absolute date in 
standard HTTP format. 

(ICEXPIRE HOST= www.cnn.com 
REGEXP = "http : // www.cnn.com/ sports /."" 
MIN_EXPIRATION = '+ 600*) 

In this example, all matching content has its expiration 
calculated using the automatic expiration algorithm, but if 
that result is less than 600 seconds from now, the expiration 
is set to 600 seconds from now. 

(ICEXPIRE HOST = www.cnn.com 
REGEXP = "http : // www.cnn.com/ sports /.'" 
MIN_EXPIRATION = 'Sun, 2 Feb 1997 lftlCklOGMT) 

In the second example, the behavior is identical, but the 
minimum expiration is a specific time in the future. 
3.5 Lookahead 

3.5.1 Overview 

One of the major areas where the caching server adds 
value is in lookahead (pre-fetch of content). Most caching is 
based on past usage of the network; the user visits a site and 
their web browser stores that site's content for a set period 
of time. If the site is re-visited, the content is fetched locally, 
rather than over the network. 

According to one embodiment, lookahead caching uses 
predictive algorithms to determine where a user may go 
given their current location. Lookahead caching then 
attempts to fetch the desired content before the user actually 
travels to the new location. Thus, when the user actually 
travels along a web link to a new page, that page is already 
present locally and can be displayed very quickly. 

A number of different mechanisms allow content provid- 
ers to tune the lookahead algorithm with a great degree of 
flexibility for their particular site layout. According to one 
embodiment, the lookahead algorithm runs on the caching 
server. The lookahead algorithm uses tuning information 
created by content providers and maintained by a publisher's 
back-end server. Statistics gathering mechanisms are inte- 
grated with lookahead tuning, creating a feedback cycle 
where usage of a site causes statistic uploads to the back-end 
server, which then automatically aggregates and updates the 
tuning information and downloads the result to all subscriber 
caching servers. FIG. 5 is a flow chart illustrating an 
overview of statistics gathering according to one embodi- 
ment. FIG. 6 is a flow chart illustrating an overview of one 
embodiment of lookahead caching. 

3.5.2 Terminology 

Following are some terms used in the next sections: 
Initial page 

Whenever the user requests a page from their web 
browser, that page is looked ahead upon. That user- 
requested page is known as the "initial page". 

Child page 

A page reachable via URL from a "parent page". The 
lookahead algorithm works by analyzing child links of the 
initial page, and then recursing on the pages pointed to by 
each child link. 

Parent page 
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An HTML page whose children are analyzed by the 
lookahead algorithm. 
Lookahead Level 

Also known as "lookahead depth". The number of links 
5 between the initial page and the current page. A lookahead 
level of 1 includes all child pages of the initial page, together 
with their inline images and applets. A lookahead level of 2 
includes the child pages of each child page of the initial 
page, together with their inline images. Repeat for levels 3 
and 4. Levels of 3 and above include an enormous number 
of pages. 
Positive filter 

A regular expression applied to child link URLs. If the 
URL matches the expression, the lookahead algorithm con- 
tinues at that point, otherwise it stops. 
15 Negative filter 

As above, but lookahead continues only if the URL does 
not match the regular expression. 

Weight 

An arbitrary number assigned to a child link, representing 
20 the probability relative to its siblings that it will be traversed 
by the user. 
Score 

The relative importance of one lookahead request relative 
to all other requests currently queued. This includes requests 
25 made from other browser windows, or by other components 
of the system. 

3.5.3 Algorithm Overview 

According to one embodiment, the caching server's loo- 
kahead algorithm starts by assigning a score to an initial 

30 page. When the page arrives, the caching server scans the 
page for child links to other pages. The caching server then 
assigns each of these child pages a "weight," or likelihood 
that the child pages may be traveled along by the user. The 
weights are numbers that represent the likelihood relative to 

35 other pages on the site that that page may be accessed. The 
weight is the content provider's opinion of the page's access 
likelihood relative to other pages. According to one 
embodiment, the user's actual browsing behavior is not 
taken into account in determining the weight. 

40 According to an alternate embodiment, the user's brows- 
ing behavior is also examined. Both weight and browsing 
behavior are then fed into an algorithm, together with the 
parent page's score. That results in a child page score. The 
caching server then queues requests for all the child pages in 

45 descending score order and begins fetching them. The 
algorithm has the property that the child link scores, when 
added together, result in the parent page's score. The scoring 
algorithm is described in detail in the next section. 
According to one embodiment, lookahead processing is 

so recursive. As soon as a lookahead request completes, the 
retrieved page is analyzed for links and the same scoring 
algorithm is executed over those links to yield a set of 
probabilities that they will be traveled along if the parent 
page is reached. Because a given page's links' scores all sum 

55 to the parent's scores, this embodiment of the algorithm has 
the desirable behavior that all lookahead requests generated 
from a given page have scores that preserve the likelihood 
relative to one another that the page will be selected by the 
user. The algorithm also has the desirable behavior that it 

60 converges automatically. Each page's score gets smaller the 
farther it is from the original page. Eventually it reaches 
zero, at a rate inversely proportional to its weight and the 
weights of its parents. The higher a link's weight, the larger 
percentage of its parent's score it will receive and propagate 

65 to its children. 

The algorithm is halted as soon as the user requests a new 
page. The algorithm is then started from the new initial page. 
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Any existing lookahead requests remain queued, but their According to one embodiment of the present invention, 

scores are set to zero. If they appear in the new run of the the algorithm is as follows: 
lookahead algorithm, then their scores will be assigned as 
appropriate, relative to the user's new position. Once the 

new run completes, any leftover requests with zero scores 5 ~ ^ \ . , , !7TTT 

v • 4 f • - ,/ /? i-i , L j , p - parcntScorc * [ ( Ml(n, nHita) • ( nHits / totalHits) ) + 

are eliminated, since it becomes highly unlikely based on the (M2(n, nHits) * (weight / totaiwdght) ) ] 

algorithm that the user will want those pages. This behavior if hits <« 3n, then 

has the desirable property that the number of queued loo- ??S n ' ^ te > ? : ^ its / , 

i ■ j , ii^i , M2(n, nHits) is: (4n - hits) / 4n 

kanead requests remams bounded and doesn t consume too if hits > 3d then 

much caching server resource. 10 mi(d, nHits) is: hits / (hits + n) 

Many instances of the algorithm can run in parallel. The n . ^ + n) 
caching server can, for example, run an instance of the 

algorithm for each browser window that the user has active. „. . t , , . . 

tP t « „ _ , . * , units is the number of times the child paee has been 

If he cachmg server runs as a network service used by 15 accessed b ^ ^ Tota]Hits ^ ^ ^ ^ 

multiple users each with their own browsers, then the allpages accessible from the parent page have been accessed 
caching server can run an instance of the algorithm for each by the user. Weight is the child page's content-provider- 
browser. Furthermore, the caching server can run an instance assigned page weight. TotalWeight is the total of the weights 
of the lookahead algorithm for each subscription (see of allpages accessible from the parent page. N represents the 
following) that the user has created. 20 size of the site and is described in detail below. Functions 
Each algorithm instance runs in parallel with the other M1 and m are both scaled b y percentages, Ml by the 
algorithms, and can have its own configuration. The scores percentage of all the parent page's children that this child 
resulting from each algorithm's analysis are all absolute, has becn ^ and m bv thc Percentage that this child's 
meaning that the score assigned the initial request is the only ff t the to f ^J?^ OT pa f * ° f * e 
way for one instance's resulting lookahead requests to be 25 tW0 S ^ d "j™ of e ^ h f M P /f will therefore simi to the 

' * tU *u • * » ♦ parent's total, giving the desired behavior that the sum of all 

more important than any other instance s requests. UMJ . .. b 

r 7 M child scores is the parent score. 

3.5.4 Child Scoring Algorithm a piece-wise linear approximation of logarithmic behav- 

According to one embodiment, a child link's score has ior is obtained by dividing the scoring algorithm into two 

three inputs: the user's past browsing behavior on that link's 30 areas, one used when the number of hits is less than 3N and 

page, the content provider's weight for that child page one when the number of hits is greater than 3N. The "knee" 

relative to the other pages on the site, and the parent's score. m tne approximated curve occurs when the number of hits 

Inline images, applets, and certain other objects are always * exact ly 3N. Variable N is used instead of a constant 

assigned their parent's weight. Objects of these types are number of hits to parameterise the knee in the curve by the 

always automatically requested by the browser whenever the 35 size of the site - ^ ^ * not S ranted a statistically valid 

browser sees their links, so by definition they are as likely browsi ng sa *Pk ™til the user has accessed a certain 

to be accessed as their parent, hence getting their parent's P~ a S e of a site - Otherwise the user's browsing activity 

w • ^ would be too heavily too quickly on a small site for adequate 

behavior in a large site, or too heavily too slowly on a large 

HTML pages are more difficult to score. The algorithm 40 s j te for adequate behavior for a small site, 
must pay most of its attention to the content-provider- The two functions Ml and M2 are complementary. Ml (the 
assigned page weight until the user's browsing behavior user-behavior term) gets larger and M2 (the content- 
becomes statistically valid. In the absence of any user input, provider term) gets smaller the more the user visits the site, 
the content provider clearly has an idea of how their content The Ml value increases and the M2 value decreases rapidly 
is used, particularly when their weights are determined via 45 in the first part of the algorithm, where the user has hit the 
a statistic gathering mechanism (described later). The algo- site fewer than 75% of the total number of links (3N=3* 
rithrn assigns the right proportion of user behavior and (number of links/4)). In a site with 28 links, for example, 
content-provider-assigned weight to the score so that the 7 hits, the user behavior is weighted 25% and the 
more the user visits the site, the more his or her input is content-provider 75%. After 14 hits, the relative percentages 
valued. 50 are 50 — 50, and after 21 hits, 75/25. 

The algorithm also weighs user behavior such that a user ■ Aftc ' ^ al f rith ^ d rops into a slower mode, 

need only visit a small site (site with a small number of J™ ^d^Tfl 0 LT\ h h h 

. J „ . - , . . « slowly relative to the content-provider-derived behavior, 

pages) a small number of times to get a statistically valid ^ ^ ^ in ^ aboye { ^ Q{ 

surfing sample. Auser must correspondingly visit a large site 55 are 80/20 m number of Wts has increased b 50% but me 

(site with a large number of pages) a large number of times percentage of user-derived behavior is has only risen 5%. 

to get the same statistically valid sample. For the lookahead i 0 the beginning, almost all of the child's score comes 

algorithm to correctly represent probabilities, the sum of all from its weight ovcr me wcight total M me numbcr of hits 

child page scores on a parent page must sum to the parent increases, more and more of the score comes from the 

page's score. 60 num t, er of hits, until finally it levels off at a contribution of 

According to one embodiment, the algorithm is logarith- 75% to the total score, 

mic to allow the user's behavior to overwhelm the content 3.5.5 Default Weight Assignment 

provider's weighting as soon as statistically possible, then Proper assignment of link weights is crucial to correct 

level off as the user visits the site more and more, so that lookahead behavior. The caching server can automatically 

there is always a minimum effect from the content- 65 assign weights to each link. Given that the caching server 

provider's weights. A piece-wise linear approximation may knows nothing about the site's semantics, however, there is 

also be utilized. no way for the caching server to do more than make basic 
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assumptions about link placement and use those assump- 
tions to assign weights. By comparing the link placement to 
a number of different stored profiles, the caching server can 
apply one of a suite of link assignment algorithms to the 
page. s 

Alternatively, according to one embodiment of the present 
invention, weights can be assigned in exponentially decreas- 
ing magnitude to links as the links are encountered on the 
page. This embodiment is based on the assumption that links 
near the start of the page are more likely to be traversed than 
links farther down the page. The algorithm divides links into 
two categories: links on the same site as the parent page, and 
links on a different site. The algorithm gives higher weights 
to links in the same site. The algorithm looks for links in the 
order encountered in the parent page's HTML stream. The 
first N links are assigned the maximum weight W. After each 
N links, the current maximum weight \V is lowered by a 
scaling factor S. This process continues until the W hits a 
minimum value. According to one embodiment, W is 256, N 

is 4, and S is 0.5. Thus the first four links encountered are „„ 

20 

assigned a weight of 256, the next 4 get 128, the next get 64, 
and so on down to 1. Other values of W, N and S may also 
be utilized. 

Links outside the parent page's site are assigned a reduced 
weight, under the assumption that the user is less likely to ^ 
stray from the site than stay in the site. Each off-site link is 
assigned a weight which is the current maximum weight W 
reduced by a reduction factor R. The off-site links do not 
count toward N, i.e. the off-site links never cause W to be 
lowered. According to one embodiment, R is 0.25. Thus 3Q 
every off-site link encountered during the first 4 on-site links 
gets a weight of 64, then 32 if encountered during the next 
4 on-site links, and so on down to 1 . 

3,5.6 Custom Weight Assignment 

According to one embodiment, in situations where the ^ 
default weighting algorithm is inappropriate, the system 
allows the content provider to specify weights for each page 
on the site. The weights are stored in a TOC, described in 
Section 3.1, Bulk Validation. Each URL defined by a TOC 
ICPAGE tag can have its own custom weight, in addition to 
the standard bulk-validation information. 

Each ICPAGE tag in a TOC contains, in addition to the 
member URL name and its validation information, a 
content-provider-supplied weight: 



40 



45 



(ICPAGE URL - http : // trutkincommoucom/ library/ achann el shtml 
WHGHT = 1234 LAST/MOD = 84762964) 



That weight is then used in place of any weight calculated 50 
by the caching server's default weight assignment algo- 
rithm. The TOC can initially be generated automatically by 
the back-end server, with default weights assigned via the 
algorithm described in the previous section. As caching 
servers run on many client machines, they upload their usage 55 
statistics to a central collection point. Each caching server 
uploads a version of its TOC periodically, the exact fre- 
quency being defined by the content provider as part of the 
TOC data. The uploaded information contains each TOC 
member URL and the number of times the uploading each- 60 
ing server has accessed the particular member. 

The result of the uploads is an exact picture of the site's 
usage patterns, on a per-URL basis, automatically grouped 
by TOC. In addition to being valuable site organization data, 
the hit counts can be aggregated manually, automatically, or 65 
a combination of both, and fed back into a new version of 
the TOC, which in turn is downloaded to the caching servers 



as their copies of the TOC expire. This feedback cycle 
automatically tunes the lookahead algorithm on a per-TOC 
basis, exactly in accordance with actual usage patterns. 

3.5.7 Other Lookahead Configuration 

While weight calculation and page scoring are the foun- 
dation of successful site lookahead, the lookahead algorithm 
can also be configured along several other axes. Moreover, 
the algorithm can be configured separately for different sites, 
even for different subscriptions in the same site. Each 
configuration is described by a piece of meta-data called an 
ICLOOKAHEAD tag. That tag can be placed anywhere, and 
typically appears in one of two places. According to one 
embodiment of the present invention, the tag can be placed 
in a site meta-data page along with other meta-data, such as 
pointers to TOC information, or regular-expression-based 
ICEXPIRES expiration tags. According to another 
embodiment, the tag can be transmitted as MIME data in 
channel subscription notification data. In both cases, the 
configuration itself is identical but retrieved differently. 

The lookahead configuration tag controls the following 
algorithm parameters in addition to the custom expiration 
data described earlier in this document: 

maximum depth 

maximum number of links 

pruning regular expressions 

pruning file types 

lookahead off site or not 

lookahead out of TOC or not 

lookahead on images, audio, or video 

lookahead for a maximum amount of time 

lookahead to a maximum amount of data 

The configuration scheme is HTML and is therefore easily 
extensible. Following is an example of the ICLOOKA- 
HEAD tag and its attributes: 



<ICLOOKAHEAD DOMAIN_NAME-cnn.com 
HOST_REGEXP-.*. cnn.com 

MAX_DEPTH-2 MAX__UNKS-50 NO_GRAPHICS- FALSE 
NO_OFFSITE-TRUE 

PROCEED JF„MATCH-". */topstories/* .html"> 

In addition to lookahead configurations that are bound to 
channel subscriptions, the content provider can have any 
number of lookahead configurations bound to site (host) 
name regular expressions. According to one embodiment in 
order to improve performance, the caching server uses a 
two-stage lookup mechanism similar to that used by ICEX- 
PIRE tags. In this case the first stage is the host's "domain", 
i.e. the last two labels of the host name. The domain is stored 
in a hash table and can be looked up quickly. Whenever a 
page is looked ahead on, its URL's host name's domain is 
looked up in the hash table. If an entry is found, all 
lookahead configurations for that domain have their host 
name regular expressions compared against the URL's host 
name. The configuration whose host name regular expres- 
sion first matches the URL's host name is used to configure 
lookahead for that URL. The two-stage lookup algorithm 
thus ensures that domains with no custom lookahead are not 
slowed by domains with lots of custom lookahead. 

3.5.8 Depth Configuration 

The maximum depth parameter controls how many levels 
of links are chased by the lookahead algorithm. Level one 
lookahead consists of the current page's links and all of their 
inline images. Level two lookahead consists of level one 
lookahead plus each child page's links and all of their inline 
images. Levels three and four extend the algorithm further. 
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Traditional "spidering*' algorithms chase all links to a spe- 3.5.11 Pruning Regular Expressions 

cific level. The result is an explosion of requests at levels The content provider can create two regular expressions 

above two, rendering the spidering almost useless unless the in each lookahead configuration, as illustrated in FIGS, 

user has high-speed network access and lots of time on their 7A-B. One expression is used as a "positive filter" for links 

hands. 5 on a page. For a page link to be considered for lookahead, 

One embodiment of the present invention uses a combi- j^h m ^ matCh *° P ositivc - filtcr expression ™ 

nation of level and lookahead score to regulate lookahead FIG * 1A * ™* W"*™* s yntax used is not 

andkeepitefifective.Anypotentiallookaheadrequestbelow mport ^ 35 b f ng ^ ^ functionality to match a 

t , r iU u .ir-j reasonable vanety of URLs without too much work by the 

evelon e must ha^ascore above a m^um cutoff m order „ eontent ide / In ^ foUowin { ^ ^ 

to be constdered^e cutoff score is 1/50 the original page's expression laDguag6 ^ that used by me ^ Emacs 6 text 

score, ibus as the lookahead algorithm gets farther and editor. Examples of other equaUy usable languages are those 

farther removed from the original page, the scores it derives used ^ the p osix Unix standard, Microsoft Developer 

get smaller and smaller, until they fall below the cutoff. studio, the Unix "egrep" program, or the Epsilon text editor. 

Scores do not drop uniformly but rather according to the is The second regular expression is used as a "negative 
relative importance of the various links traversed. If one link filter" for links on a page, as illustrated in FIG. 7B. For a 
on a page has 75% of the total score on that page, the other page link to be considered for lookahead, its URL must not 
links at its level may all have scores too low to allow match the negative -filter regular expression. This type of 
lookahead to continue through them. The page with the large regular expression is useful for screening out certain types of 
score may, however, have a score large enough for it and its 20 Urjks tnat the content provider wants never to be looked 
children to be looked ahead on. This depends on the page ahead on - T yP ical candidates are executable files, full- 
scores, which in turn depend on user behavior and content- moU ° n 2^°' °J S0 , und L fi ¥* 
provider-driven behavior. The merging of cutoff score, the I*: 1 } ^^head Off Site 
scoring algorithm, and level control gives the content pro- ™* 1(K *^^ Tl & * *J*°°*« 

vider exact control over how lookahead is performed, ^e 25 C ™ tr ° b 1 £therl °° ^ ead f ^^^ ^^x 11 ? 8 ; 

u . f ♦ *u * i -j • Off-site links are defined to be those links whose URL host 

result is far more accurate than traditional spidenng. nn . . . , iU , 4 4 f . . 

^ & component is different from the host component of their 

3.5.9 Link Count Configuration par ent page. Content providers may set this value to "no 
Link counts are the second parameter to lookahead con- lookahead off-site" if they do not wish the caching server to 

figuration. The content provider can control the maximum 30 expend resources such as network access time, processing 

number of links looked ahead on the initial page. The link utne > or disk storage looking ahead for pages not owned by 

count applies only to the initial page's links. Their child the content provider, 

links are looked ahead according to cutoff score, as detailed 3.5.13 Default Lookahead Configurations 

in the algorithm description previously. Just as the caching server has a default link weight 

Link count creates flexibility for content providers. 35 computation algorithm for use in cases where the content 

Normally, all links off the initial page are looked ahead on. P rovlder does not P r0Vlde m their T0C 

Lookahead score is used only to prioritize the requests. If the meta-data, the caching server also has default lookahead 

initial page has a large number of links, however, the configurations for cases where the content provider has no 

algorithm may spend too much time looking ahead on links applicable ICLOOKAHEAD information in their site meta- 

unlikely to be traversed by the user. By restricting the link 40 data > 01 no lookahead configuration is provided with a 

count to a smaller number, the lookahead algorithm can subscription channel notification. According to one 

complete more quickly, and spend its time analyzing sub- embodiment, the cachmg server uses two similar default 

sequent pages. lookahead configurations, one for browser-based lookahead 

icmT i i j t aj- and one for channel-subscription-based lookahead. Addi- 

3.5.10 Lookahead on Images, Audio, Video ^ *• i c *• -i L * j j ^ 

6 ' * 45 tional configurations can easily be stored, and the existing 

According to one embodiment, the lookahead algorithm configurations may also be modified by the end user, 
can be further tuned by being configured not to look ahead default configuration currently used by the caching 
on images, audio content, or video content. These types of ser ver for browser-based lookahead has the following set- 
content are typically much larger than HTML pages, and ting S: 

therefore take longer to download. In the time taken to 50 Maximum depth 1 (initial page's children and their inline 

download a single JPEG image, for example, the server images) 

could download ten or fifteen HTML pages. In the time Maximum links 50 (the first 50 links encountered on the 

taken to download a single WAV file (audio file) tens of initial page) 

HTML pages could be loaded. The savings is even greater Lookahead off-site 

for video content. 55 No positive-filter regular expression 

Users generally want images. Content providers also Lookahead on images, but neither audio nor video 

general wish images downloaded, particularly if the images Negative-filter regular expression to remove executable 

are advertisements. Occasionally, however, site images are files, server-side image maps, and binary data files 

not important relative to the text on the page. In those cases, 3.6 Subscriptions and Notification 

the content provider may want to disable lookahead on the 60 Once a user subscribes to content from a publisher, that 

images and download the text content much more rapidly. content must be delivered to the user's desktop. This section 

Audio and video, for example, tend to be much less impor- describes the process by which a user subscribes to content, 

tant to the look and feel of a page, and may therefore and it describes the system that pushes notification data from 

typically be disabled. In addition, image, audio, and video the publisher to the user's desktop, 

lookahead can be tuned so that rather than either happening 65 Creating a subscription is intended to be a simple, light - 

at parent priority or not all, they happen below parent weight process, for example a single click of a button on a 

priority, possibly after all text lookahead has completed. web page by the user. Similarly, deleting a subscription is a 
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simple operation performed on the content bar by the user. 
Subscribing to content begins the flow of notifications from 
the content provider to the user. Unsubscribing terminates 
the flow of notifications, guaranteeing that the user sees no 
more information from that subscription. 

The subscription process is also intended to be highly 
configurable. There are a number of different notification 
mechanisms available to the publisher, and each is appro- 
priate in different situations. The notification system allows 
the content provider and the client to negotiate notification 
mechanisms, and further allows restrictions by intranet 
administrators, so that clients on private networks operate 
under rules defined by the network administrator no matter 
what the information publisher wants. 

3,6.1 Service Requirements 

The following describes some of the different areas that a 
notification service must address: 
shared data 
personalized data 
reliable delivery 
confirmed delivery 
store-and-forward delivery 
internets and firewalls 
security 
anonymity 

user control over subscriptions 

3.6.1.1 Shared Data 

According to one embodiment, the notification system 
must be able to handle effectively data shared by a large user 
community. Given that the data is shared, notifying sub- 
scribers of its presence is most effectively performed by a 
multicast protocol. Multicast protocols save network 
bandwidth, improve origin server performance by sending 
only a single copy of the data, and keep the origin server 
from having to maintain subscriber lists (although such lists 
may be maintained for other reasons). 

3.6.1.2 Personalized Data 

At the other end of the spectrum is highly personalized 
data, such as stock portfolio updates and personalized news- 
papers. The network overhead of maintaining multicast 
groups in this instance is wasted because there is only ever 
one recipient of the data. Instead, the system must be able to 
unicast notification data, or at least an indicator that such 
data is available. 

3.6.1.3 Reliable Delivery 

Publishers originating notification data need to know that 
their subscribers will receive the data. "Reliable" in this 
context is fairly basic, on the order of email-based reliability. 
The system must guarantee that the data arrive at all sub- 
scribers "eventually", with the publisher having at least 
some control over the maximum time beyond which it 
knows that 99% of the subscribers have received the data. 

3.6.1.4 Confirmed Delivery 

Confirmed delivery takes reliable delivery one step far- 
ther. The publisher not only needs to know that its data will 
eventually be received by all subscribers, they also need to 
know which subscribers received the data and at what time. 
Such a system requires subscriber lists, with individual 
subscribers contacting the publisher on receipt. This type of 
return-receipt-request may have an impact on the perfor- 
mance of the system. 

3.6.1.5 Store and Forward Delivery 

A class of subscriber machine that is not always con- 
nected to the network includes laptops that dock with a 
networked station periodically, or a machine that dials into 
the network periodically to pick up information. Another 
class of machines uses DHCP (Dynamic Host Configuration 
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Protocol) for dynamic IP address management. Such 
machines may overlap with the previous class of machines, 
but also include desktop machines permanently connected to 
the network but whose addresses are managed dynamically. 

Both classes of machines can benefit from using a proxy 
notification server that is responsible for handling incoming 
notifications and buffering them for the user. If such a 
service is not available, frequently-disconnected machines 
will be forced to poll, since they cannot count on receiving 
the notifications, DHCP machines are forced to poll if they 
operate in a unicast-only environment, because unicast 
requires address lists, and it is not possible to maintain 
address lists effectively if the addresses change constantly. 
Using host names for address transparency does not work 
either, because many of the machines do not have names. 

3.6.1.6 Internets and Firewalls 

The notification system must be general enough to per- 
form well in an intranet context as well as in an Internet 
context. One obvious problem with use on the Internet is that 
the publishers and their subscribers will frequently be on 
opposite sides of a firewall. Firewalls are frequently con- 
figured to let requests out into the Internet, but to bar 
unsolicited information other than email from travelling into 
the intranet. 

The notification system needs to function reasonably well 
in a firewall environment that behaves in this manner. The 
notification system also needs to offer notification function- 
ality that is simple enough that network administrators can 
scope any security issues easily. The fewer the security 
concerns, the more likely notifications may be allowed 
through a firewall by network administrators who believe 
the benefits of asynchronous notifications in terms of net- 
work bandwidth savings make it worthwhile to reconfigure 
their firewall software. 

3.6.1.7 Security 

The notification system uses existing security infrastruc- 
ture to give subscribers assurance that incoming notifica- 
tions are indeed from the desired publisher, and not from a 
malicious third party. In addition, notification data is 
encryp table. 

3.6.1.8 Anonymity 

Subscribers may wish to remain anonymous from pub- 
lishers. The notification system must be able to provide a 
level of indirection between publishers and subscribers that 
implements anonymity. A multicast notification system by 
itself does not guarantee anonymity. Instead, the system 
needs to use proxy notification servers that act on behalf of 
client wishing to remain anonymous, 
3.6.1.8 User Control over Subscriptions 
Once a user creates a subscription, he or she must be able 
to remove that subscription and know that the system will 
immediately stop accepting notifications for that subscrip- 
tion. This gives the user fine-grain control over the types of 
information he or she receives and does not allow the 
provider undue privilege. This is a key solution to a major 
problem with electronic mail: unsolicited email. Email as a 
notification mechanism, operates at too gross an addressing 
level. By giving one's electronic mail address to a publisher, 
the user loses the ability to screen out future unwanted 
information from that publisher, and has no control if that 
publisher passes the email address to someone else. The key 
to solving this problem is the scheme of registration and 
subscription where the user retains control of whether to 
accept or reject information on a fine-grain subscription 
level. 

3.6.2 Notification Services 

The following section describes in detail the different 
components of the notification system and how the compo- 
nents implement the various notifications requirements. 
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3.6.2.1 Components 

The system comprises the following components: 
Client drivers, one per notification mechanism. 
Unreliable ping protocol, either unicast or multicast 
Unreliable notification protocol, intended for multicast 
Synchronous request algorithm 
Return-receipt support 
Backup polling algorithm 
Notification proxy server 
Subscriber list management 

Subscription meta-data describing parameters in section 
3.2 Meta Data. 

3.6.2.2 Client Drivers 

According to one embodiment, notification services are 
implemented as loadable "drivers", each implementing a 
common service interface. A partial list of operations fol- 
lows: 

start 

stop 

subscribe 
unsubscribe 
show-configuration 
configure 

According to one embodiment, in order to implement a 
new notification mechanism, the standard driver interface is 
implemented. A common notification system manager 
handles all generic tasks, such as subscription validation, 
driver management, and delivery of information to the 
caching server. 

3.6.2.3 Reliability and Notification 

The system does not attempt to notify using a reliable 
transport protocol, As far as the user is concerned, the high 
level notification process provides reliable delivery, but the 
system implements reliable delivery by using a combination 
of lightweight unreliable asynchronous notification, syn- 
chronous requests, and backup polling. 

There are a number of problems with reliable transport. In 
the multicast world, reliability is difficult to implement well. 
Protocols like RMTP, for example, deal with various aspects 
of reliability, but at the expense of complexity. In the unicast 
world, reliable transport via TCP is easy to implement, but 
does not provide any bandwidth or server performance 
savings over unreliable notification followed by synchro- 
nous requests. In fact, the latter mechanism can provide 
higher performance than reliable unicast if caching servers 
are used (see following). 

According to one embodiment of the present invention, 
asynchronous notification is implemented by providing an 
unreliable multicast atop IP multicast, and a very simple 
unicast "ping" protocol. Unreliable multicast over LANs 
will end up being reliable most of the time without requiring 
all the additional protocol complexity. Unicast will simply 
provide ping functionality, since transmitting the data itself 
to all recipients takes longer than asking the recipients to ask 
for the data. 

3.6.2.4 Unreliable Ping Protocol 

In the unicast world, asynchronous notification by trans- 
mitting the entire notification is not practical. The publisher 
becomes responsible for sending a copy of the data to every 
subscriber, which is no different from the subscribers asking 
for it, providing the subscribers never ask unless there is new 
data available. 

The "ping" protocol implements a means for the publisher 
to notify subscribers that new data is available for them to 
retrieve. This protocol immediately improves performance 
over simple polling because subscribers only ask for data 
when new data is available. The process is analogous to the 
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post office leaving a user a note that there is a package 
waiting to be picked up. The user does not have to drive to 
the post office every day, but rather only when a note tells the 
user that a package is waiting. 

5 Each subscriber thus needs to request data from the server. 
In the case where the information is shared and public, 
whenever a subscriber receives a ping, they wait a random 
amount of time before requesting the information. The first 
subscriber on a network segment requests the data of a 

10 caching proxy server or notification proxy. That entity then 
requests the data of the publisher. The random wait prevents 
all subscribers from asking at once, and increases substan- 
tially the likelihood that they can get the data from a cache 
instead directly from the publisher, thus reducing server 

15 load. Even if there are no intermediaries between the sub- 
scriber and the publisher, the random wait distributes the 
load at the publisher. 

If the data is not shared, then each subscriber does have 
to request the information from the publisher. But the 

20 overhead of transporting the information from publisher to 
subscriber would still have to happen once per subscriber. 
Multicasting personalized information does not render a 
benefit. Having the subscriber request the information is 
better than the reverse because the mechanisms already 

25 exist, they pass through firewalls, and they do not require 
additional store and forward infrastructure at the publisher. 

The ping protocol is inherently unreliable, thus requiring 
a mechanism to deal with lost pings. Sequence numbers 
incremented by one may be used for each notification sent 

30 by the publisher. A subscriber that sees a hole in the 
sequence space simply asks for the missing notifications). 
This mechanism is only necessary if the notifications com- 
prise a stream of data, all elements of which must be 
received by the subscriber. If new notifications subsume 

35 older ones, then the sequence scheme does not need to be 
used. 

Whether or not sequencing is used, the system also has to 
handle situations where no notifications arrive for "too 
long". "Too long" is a time period defined by the publisher 
40 in the subscription meta-data sent to the subscriber at 
subscription time. When that time period elapses with no 
notifications, the subscriber polls the publisher for any 
changes and resets its timer. Whenever a notification arrives, 
the time period is reset, so that a poll only occurs N minutes 
45 after no word has arrived from the publisher. As long as the 
publisher's notifications arrive at regular intervals driven by 
the content, polling will almost never occur. Polling will 
occur only in the unlikely event that a packet was dropped 
between the publisher and the subscriber, or in the case 
50 where the subscriber's machine was disconnected from the 
network for a sufficiently long period of time. 

The ping protocol works as follows: A single UDP packet 
is sent to each subscriber. In multicast configurations, the 
packet is sent to the subscription's multicast group. The 
55 packet contains the following information: 
Publisher host name 
Subscription identifier 
Sequence number 

URL to request, if it changes constantly and cannot 
60 therefore be part of the subscription meta-data. URL is 
parameterised by subscriber identifier. 
3.6.2.5 Unreliable Multicast Protocol 
Delivery of actual notification data in multicast-enabled 
environments incurs bandwidth savings and performance 
65 gain on servers that do not need to waste time sending 
multiple copies of the data. LANs are generally reliable and 
the likelihood that a multicast notification will be received in 
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its entirety is high. Backup request and polling mechanisms Return-receipt requires that the publisher maintain a list 

may thus rarely be required. of its subscribers and mark each subscriber as having 

According to one embodiment, the protocol must provide received the notification. A database is a logical choice for 

error detection, so that subscribers know if they missed a the list, since return-receipt subscriptions may well be 

packet and can request the data of the server directly. A 5 highly personalized, or require payment, in which case a 

simple packet sequencing scheme works just fine, and a database with other subscriber information may already be 

higher-level notification sequencing scheme tells subscrib- ^ place Database entries ^ created at subscription ^ 

ers when they have missed a complete series of packets, or md rcmovcd at Umc . 
the last packet of the previous notification. 

llie notification data is broken up into UDP packets with 10 are ako cachin S issucs ^ classes of retura - 
the following header information: recei P l subscriptions. If the information is widely shared in 
Publisher host name a tinicast environment, it still cannot take advantage of 
Subscription identifier caching, since cached copies would by definition not be 
Data checksum requested of the publisher, which would lose any return- 
Data length 15 receipt information. Instead, any URLs which identify 
Return-receipt URL, parameterised for subscriber identi- return-receipt content must be parameterised by subscriber 
fier identifier so that the publisher can determine the subscriber 
Notification sequence number within subscription who received (multicast) or is receiving (unicast with syn- 
Packet sequence number within notification chronous request) the content. The HTTP operation must 
last-packet indicator 20 also be marked by the requesting subscriber as "no-cache", 
The notification data is broken up into packets by the i, e . do not serve a cached copy. Finally, in order to keep the 
protocol, and each packet is then multicast to all interested caching server from caching many copies of the data in one 
parties. The last packet in the message is tagged with an embodiment, HTTP 1.1 cache control operations can be used 
indicator so that recipients know when the message has been by the publisher to prevent content being cached. According 
received. The protocol ensures that as few packets as pos- 25 to another embodiment, in the HTTP 1.0 environment, an 
sible are dropped. The protocol can easily combine packets expiration date in the past serves the same purpose. 

into groups and wait a fixed small amount of time between d^*,™ ai&~- * u - A . • 

^ . K - . ^ .... Return-receipt operations diner in multicast and unicast 

ttansmissions ot tfte next group. Ihe publisher may use notification environments . i n the 

unicast world, the return- 

S-MIME to sign and optionally encrypt the notification recei ^ ^ ^ and occurs ^ ^ same ^ ^ content fc 

C °3 62 6S "nXo^Rru ° D ^ ^ re 1 nested - M needs to be done » for URL and the 

. . . ync onous eques request to circumvent caching as described above. In the 

Synchronous requests are an integral part of the overall multicast wodd the lMbet ^ nas ^ ^ ^ lh6 

system s rehable-debvery semanUcs, because the notifica- return . recei t becomes a sjmple t ^ no data> where 

hon protocols are unreliable and may not even carry any again the URL and request are formatted as described above, 

data. Any time synchronous requests are used, the publisher 35 ^ subscriber must form a random wait jllst „ it woM 

is m danger of overloadmg. To mirumize the nsk, all ffl ^ unicast worl4 tQ ayoid ^ bIishcr ^ 

synchronous requests are preceded by a random wait inter- reauests 

val. Whenever a ping notification is received, each recipient ^r^oni t» « ■ 

waits a random amount of time before requesting the noti- 3 ' 6 ' 2 ' 8 Backu P PoUlD S Algorithm 

fication data. Similarly, whenever a broken or missing 40 If the current polling mechanism proceeds on its own, it 

multicast notification is detected, the detecting recipient may request information when that information is already up 

waits before requesting the data directly. to ^ ate > causing spurious requests of the server and lowering 

Random waiting has two direct benefits. First, if there are performance. Instead, according to one embodiment of the 

caching servers between the publisher and the subscriber, present invention the polling mechanism advances its timer 

random waiting increases the likelihood that only one recipi- 45 b y foe polling interval any time one of the following events 

ent will request the content of the publisher, with the other occurs: 

recipients getting a cached copy. Second, even if there are no The subscriber receives an unreliable ping and follows 

caching servers in the loop, random waiting distributes the with a request of the publisher 

load at the publisher. Most publishers are set up to deal with the subscriber receives a broken multicast notification 

high average request volumes, the notification process 50 and follows with a request of the publisher 

already eliminates serious polling multicast notifications ^ receives a valid mu]ticast notification 

will almost always be reliable, and the result is that load at _ ... „. 

the server should be manageable. B ? advancing its timer, polling becomes a true backup 

3 6 2 7 Return-Receipt mechanism, used only if no notifications arrive, or if a 

"Return-receipt" means that the publisher needs confir- 55 mu }f icast transmission breaks. The publisher controls the 

mauon from each subscriber that a notification was received. PolUng interval via me subscription definition. The interval 

There are obvious performance impacts, because the scheme should be matched j° me formation's update frequency or 

requires both a subscriber list and direct communication to lts depending on how much the publisher 

between each subscriber and the publisher. The impact is trusts the notification mechanisms, 

most severe in an unreliable multicast environment. The 60 3.6.2.9 Notification Proxy Server 

system goes from one where a single copy is delivered to A notification proxy server implements the notification 

many unknown recipients with a high likelihood of mechanism and uses it on behalf of a subscriber community, 

reliability, to one where that process is followed up by N The proxy server stores incoming notifications, and sub- 

acknowledgments sent back to a publishing server which scribers poll the proxy periodically for any new notifications 

must now also maintain a list of all subscribers. The addi- 65 using HTTP. The proxy also stores authentication data so 

tional overhead is least felt in unicast environments, where that only registered subscribers can use the proxy, and each 

each subscriber is already in contact with the publisher. subscriber has access only to its own notifications. 
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Candidates for proxy use are: 

Laptops which are frequently disconnected from the net- 
work 

Diahip users 

DHCP users operating in a unicast environment S 

An intranet of subscribers whose network administrator 
does not want notifications crossing a firewall and also does 
not want the firewall and network overhead of all the clients 
doing timed poll through the firewall. 

Laptops and dialup users are candidates because they are 10 
off the network often, and are therefore likely to miss 
notifications. That forces them to use timed poll more often 
than other subscribers, which may present an unacceptable 
load on the server or the network. Polling a local proxy may 
be more efficient, since a number of proxies can distribute 15 
the polling load. 

DHCP users are candidates because publishers cannot 
effectively maintain subscriber lists when the addresses keep 
changing, and DHCP addresses change constantly. The 
publisher can try to use host names for address transparency, 20 
but desktop clients frequently do not have host names 
because they do not provide services. Note that in a multi- 
cast environment, DHCP hosts work fine, because they can 
join groups anonymously. The proxy is only useful in a 
unicast environment. 25 

3.6.2.10 Subscriber List Management 

Subscriber list management is only an issue in the fol- 
lowing situations: 

return-receipt subscriptions 

unicast notification 30 
List management operations are performed during sub- 
scribe and unsubscribe operations. The subscribe and unsub- 
scribe URLs reference programs or functionality built into 
the web server that manage a database. In the return-receipt 
case, that database probably already exists, for payment or 35 
personalisation management. The list of contents can be any 
unique identifier, since it will be given the subscriber at 
subscription time, and the subscriber will place that identi- 
fier in its request URL at content request or receipt confir- 
mation time. 40 

Lists maintained for unicast notification delivery are IP 
addresses or host names, since a UDP ping protocol packet 
are sent to each member of the list. IP addresses are easier 
to deal with than host names. DHCP uses either a timed poll 
or a proxy with a stable address. 45 

3.6.2.11 Notification Filtering 

The notification system guarantees that users will see no 
further notifications from a publisher once they remove that 
subscription. Whenever a notification arrives, its subscrip- 
tion identifier is checked against the list of susbcriptions 50 
currently active. Any subscriptions not on the fist are 
ignored. In addition, the driver handling the notification can 
generate further unsubscribe requests and send them back to 
the publisher, in cast the original removal request was lost. 

There are lower-level driver-specific filtering mechanisms 55 
as well; the filtering described above is a final backstop that 
is guaranteed to keep unwanted notifications from reaching 
the user. For example, in a multicast service that does not use 
return-receipt functionality, the driver can unsubscribe by 
simply leaving the subscription's multicast group. The pub- 60 
fisher is never given any sort of client network address, so 
it has no means of reaching the client once the client 
unsubscribes. 

3.6.3 Subscription Configuration 

This section describes the process by which the notifica- 65 
tion system creates subscriptions. From the user's point of 
view, the act of subscribing to content is a very simple 



one-step operation. In particular, the user does not need to 
get involved in selecting notification mechanisms or con- 
figuring any subscription properties. All configuration is 
performed by negotiation between the client and the 
publisher, subject to any rules imposed by the client's local 
network environment. 

When the client starts up, it configures itself according to 
optional meta-data that it fetches via a special configuration 
subscription built into the caching server (see section 3.3 
Client Configuration). Part of that meta-data consists of a set 
of configuration HTML tags, one per notification driver. 
Each tag is identified by the driver's name and contains a set 
of driver-specific attributes used to configure the driver for 
use in the client's local network environment. Local network 
administrators can choose, for example, to disable certain 
drivers, guaranteeing that they will never be used to carry 
notifications. For example, the ICMCAST driver is disabled 
when the administrator puts the following configuration 
meta-data in their configuration page: 

<ICMCAST DISABLE=YES> 

Similarly, administrators can use the ICNOHFYCONFIG 
meta-data tag to choose the negotiation order in which 
different services will be sent to the publisher, and can bind 
these negotiation orders to host name regular expressions: 

(ICNOTIFYCONFIG DRIVER. LIST = 'ICMCAST, ICDOORBELL* 
HOST_REGEXP = VVincommonconf) 



The ICNOTIFYCONFIG tag allows administrations to 
force clients to use one order when communicating with a 
particular host or domain, and another order for another host 
or group of hosts. 

Once the client configures its notification services accord- 
ing to the wishes of the local network administrator, sub- 
scription becomes a matter of negotiating a driver with the 
publisher, and configuring the subscription according to its 
definition. When the client subscribes, it sends an HTTP 
request to the publisher. The request contains a list of desired 
notification drivers in preference order: 

X-inCominon-Driver-List: ICMCast, ICDoorbell 

The request also contains a configuration name/value pair 
for each driver, describing the driver such that the publisher 
will be able to send notifications to it. Each driver has its 
own configuration data, and not all drivers need this con- 
figuration information. The unreliable ping service, for 
example, needs to supply a TCP port number for the 
publisher's notifications: 

X-inCommon-Driver-ICDoorbell:<ICDOOR USTEN 13 PORT- 
2287> 

The publisher picks the first driver on the driver list that 
matches a driver it is capable of using to send notifications. 
It then performs any registration that it needs to (entering 
subscriber information in a database, for example), and 
returns the subscription as an HTML page. The page con- 
tains all the information required for the client to receive 
notifications, including: 

backup polling interval 

custom scheduling 

use of special services such as return-receipt 
notification driver to use and its driver-specific configu- 
ration data 
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At this point the subscription process is complete, and 
notifications can begin flowing from the publisher to the 
client. 

The notifications that arrive at the caching server can 
themselves modify their subscription's meta-data, changing 
its polling interval for example, or switching notification 
services from polling to multicast, 

3.7 Generalized Reporting 

The back-end server is also able to control caching server 
reporting via meta-data. Reporting meta-data is stored in a 
site meta-data page, just like other meta-data, such as 
ICLOOKAHEAD, ICEXPIRES, and TOC pointers. Reports 
are defined using the ICREPORT HTML tag. Each ICRE- 
PORT defines an internet domain, a set of filtering regular 
expressions, a report type, and an upload schedule. 

The caching server periodically scans its cache as con- 15 
trolled by the upload schedule. Every piece of cache content 
whose URL matches the ICREPORTs domain is tested 
against the ICREPORTs filter regular expressions. The 
filtering expressions consist of a "match" filter and an 
optional "no-match" filter. Each URL must both match the 
"match" filter and not match the "no-match" filter if it is 
present. 

Each piece of cache content that passes the filter then has 
information extracted from it that is appropriate to the 
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4.1 Automatic Expiration Control 
4.1.1 Overview 

Whenever the server is asked to retrieve content from the 
web, the server places the content in local storage while 
returning the content to the requestor (either the browser, a 
subscription, or the server itself). The server then satisfies 
subsequent requests for the same content from the local 
storage rather than going to the network. This strategy 
improves performance but is achieved at a cost. If the 
content changes at its origin, the caching server will deliver 
an old copy of the data from local storage, rather than the 
new copy from the net. 

The caching server solves this problem by assigning each 
piece of content an expiration date. The server satisfies 
requests for cached content from local storage until the 
expiration date is reached, after which time it checks at the 
origin site to see if the content has changed. The origin site 
may tell the server what the expiration date is, based on 
knowledge of the content (see Section 3.3, Custom Expira- 
tion Control). There may be sites, however, that do not know 
their content's expiration behavior. In these cases the cach- 
ing server is forced to invent an expiration date. The quality 
of the algorithm used to invent the expiration date is 
extremely important. If it is too liberal, the caching server 
will serve out-dated content from the cache. If it's too 



ICREPORT's report type. Typical reports include hit counts, 25 conservative, the server must access the Web frequently, 



thus reducing performance. 
4.1.2 Algorithm Pseudo-Code 

Following is a pseudo-code description of the expiration 
date computation algorithm according to one embodiment, 
followed by detailed explanations of the algorithm compo- 
nents. 



performance statistics (time required to fetch the content), or 
context in which the content was fetched (subscription, 
browser). Other reports can be created as needed, and 
identified by a new report type. 

The reporting mechanism allows any content provider to 30 
get one or more reports on any subset of their content as 
stored in all client caches that access content owned by the 
publisher. The user need not subscribe to this content; they 

just need visit the site, whereupon the site meta-data is — ^ 
retrieved by the caching server and the report upload is 35 if(documentChanged && accessedMoreRecenUylhanMocified) then 
configured. Because the filtering mechanism uses regular AddNewLtfetimeSample 
expressions, the publisher can create several ICREPORT 
meta-data tags, each defining reports for a different subset of 
their content. 

The ICREPORT tag has the following attributes: 40 

DOMAIN_NAME: the internet domain of which all 
matching content in the report must be a member. This 
domain name must also match the domain name of the site 
meta-data page's URL, thus preventing malicious content 
providers from getting report information on content that 45 
they do not own. 

MATCH_REGEXP: the regular expression which cache 
content must match in order to be reported on. 

NO MATCH _R EG EXP: the regular expression which 
cache content must not match in order to be reported on. 50 
This attribute is optional. 

REPORT_TO__INTERVAL: the interval in seconds 

between report uploads. 

REPORT_TO_URL: the URL to which the report is 4.1.3 Lifetime Samples 
delivered, via an HTTP POST. 55 The algorithm used by the server attempts to "learn" the 

REPORT_TYPE: the type of report desired. If this expiration behavior of each piece of content by tracking its 
attribute is missing, the report type defaults to a hit count modification history. Every time the content is accessed 
report. from the network, its last-modification date is recorded. If 

4. Technology Local to Caching Server that last-modification date changes and is accessed subse- 

The following sections describe technology local to the 60 quently to that change, then the object has not changed since 
caching server. Algorithms in this section provide intelligent the current last-modification date and the access time. That 
cache management and use of network resources without the time interval can thus be treated as a sample of how often the 
need for input from a back-end server. Accordingly, these object changes, i.e. its lifetime. 

algorithms cannot provide the fine tuning that is possible Each sample is plugged into an estimator algorithm that 
with interaction from a back-end server, but do provide some 65 tracks two quantities: the mean lifetime estimate M and the 
acceleration on any web site, even if that site does not have variance in lifetime samples V. As each new sample is 
a back-end server. added, the mean and variance change. The variance is 



cndif 

if (document has no modification data) then 

expiration = now ■+ fixed amount 
else if(document has not changed) then 

if(document has lifetime samples) then 

expiration = now + one sample variance 

else 

expiration = now + ((now - last modification date) / 2) 
endif 

else if(document has changed) then 

if(document has lifetime samples) then 

expiration « last modification date + mean lifetime - 
one sample variance 

if(expiration is in the past) then 

expiration = now 4 1 sample variance 
endif 

else 

expiration - now + ((now - last modification date) / 2) 
endif 
endif 
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weighted more heavily toward recent behavior. The estima- 
tor algorithm used is identical to that used in the TCP/IP 
transport protocol for network round-trip estimation, and is 
also known as "Jacobsen-Karel Estimator". The known 
estimator algorithm is utilized in a novel manner as 
described below. 

4.1.4 Case 1: lifetime samples exist 

As long as the server owning the content supplies last- 
modification dates with its content, this scheme works fairly 
well. The server accumulates a history of samples for each 
piece of content and stores them permanently. Whenever an 
object expires, ie. it is requested and the caching server sees 
that the content's expiration date is in the past, the caching 
server goes to the network and asks the content owner for a 
new copy of the object. The caching server then uses its 
samples in one of two ways to create a new expiration for the 
content Which method it uses depends on whether the 
content has actually changed or not. 

If the content has not changed, then the previous expira- 
tion date was too conservative, i.e. too short. Another sample 
cannot be accumulated because the object has not changed. 
Instead, the server finds itself in a "grey zone" where its data 
indicates the object should expire, but the owning server 
indicates that the object has not expired. In this situation, the 
server simply adds a single variance V from the content's 
estimator to the current time and uses that as the new 
expiration date. This value allows the server to provide some 
performance benefit (by continuing to cache the object). The 
variance makes a good "fudge factor" (so called because the 
server is operating with insufficient data and must guess at 
a reasonable new expiration date) because it measures the 
difference between the accumulated samples and their mean. 
The variance therefore provides at least some approximation 
to a valid sample, even if it is statistically less likely than the 
mean. 

If the content has indeed changed, then the server adds to 
the content's most recent last-modified date the current 
value M from the estimator algorithm. In order to lessen the 
likelihood that the server will expire the content too late, it 
then subtracts from the result the current variance value V. 
The single variance V is chosen for the same reasons as in 
the previous paragraph. If V is sufficiently large that the 
last-modified date plus M minus V is before the current time, 
i.e. the object will already have expired, then technically the 
object is in a "grey zone" where it could expire at any 
moment. Given that the caching server wants to provide 
some performance benefit even under these circumstances, 
it creates an expiration date that is as small as possible while 
still allowing the object to be cached. Again, the server 
chooses a single variance V, for the same reasons as 
described above. 

4.1.5 Case 2: No sample data 

If the caching server has no samples, it still attempts to 
provide a rational expiration date, but it must do so with less 
data to go on. Again, there are two algorithms used: one if 
the content actually changed, and one if the content did not 
change and the previous expiration date was too conserva- 
tive. 

If the content has not changed, the caching server con- 
structs a new expiration date which is the current time plus 
the difference between the current time and the time the 
content was last modified. Although this value is not as good 
as one derived from watching the object's modification 
history, it works reasonably well. The object is not modified 
in the interval between its last-modification date and the 
current time. It is thus predicted to not be modified for the 
same amount of time into the future, although with little 
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certainty. That lack of certainty is reflected in taking half of 
the interval rather than a full interval. Again, the idea is for 
the algorithm to balance accuracy (always handing back the 
most recent content) with performance (always satisfying 
s requests from the cache). 

If the content has changed, the frequency with which the 
object changes (one interval between now and the time last 
modified) is likely to be an inaccurate estimate. The object 
is ideally served out of the cache to maintain performance, 
10 but must also be accurate. The algorithm in both situations 
works in exactly the same manner. 

4.1.6 Case 3: No data at all 

If the originating server is unfriendly, and provides no 
modification data, there is even less data for the caching 

15 server to go on. It must thus make an expiration estimate that 
is essentially a wild guess. What it does in this situation is 
add a configurable amount of time to the current date. The 
time is based on site meta-data if provided, or a constant of 
the implementor's choosing. 

20 4.2 Cache Compaction 

4.2.1 Overview 

As the server's cache fills with content, eventually the 
cache becomes large enough that its use of resources begins 
to affect the client machine's performance. To forestall this 

25 situation, the caching server automatically compacts its 
cache periodically. The cache size starts at a standard value 
(measured in megabytes of content), which can be changed 
by the end user. Whenever the cache size exceeds its ceiling, 
compaction occurs. 

30 The most important part of the compaction process is 
deciding which pieces of content to remove from the cache 
and which to keep. If the algorithm is not discriminating 
enough, then content which the user accesses frequently is 
just as likely to be removed as content which the user has 

35 rarely seen, Lookahead complicates the situation, since by 
definition it is predictive and if it fails to predict correctly, 
content will be cached that the user never accesses. Size of 
the content also complicates things, since large content is 
more expensive to fetch than small content. 

40 The compaction algorithm described below takes into 
account a number of factors which together decide accu- 
rately which content to keep and which to throw out. 

4.2.2 Compaction Algorithm 

Compaction is an expensive operation relative to other 
45 operations performed by the caching server. In order to keep 
compaction from occurring too often, the algorithm does not 
simply remove content until the cache size returns to its 
allowable maximum. Instead, the algorithm compacts the 
cache to 75% of its former size, allowing some head room 
50 for the cache to grow before compaction again occurs. 75% 
is a reasonable compromise between frequency of compac- 
tion (if the percentage were higher) and lack of desired 
content and corresponding lower performance (lower 
percentages). 

55 According to one embodiment, the compaction algorithm 
measures the following properties of each piece of content: 
when it was last accessed 

how much network resource is required to retrieve the 
content 

60 how often it is accessed 

These three properties are normalized into a score using 
the algorithm described below. All content is then ordered by 
score, at which point compaction becomes a simple process 
of removing the worst-scoring content until the overall 

65 cache size drops below the ceiling. The algorithm places 
paramount importance on the time a piece of content was 
last accessed. If the content has never been accessed, then 
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the importance is given to the time it arrived in the cache. 
Content that is old needs to be removed quickly. Second in 
importance is frequency of access. Even old content should 
remain in the cache if it is accessed often enough. Size is 
least important, unless the content is truly huge, in which 
case the algorithm can keep significantly more pieces of 
content in the cache by deleting a single large object, and 
should probably do so. The scoring algorithm normalizes 
these three components (size, last-access, and frequency of 
access) into a single value, where high scores indicate a 
more suitable candidate for removal, and low scores a more 
suitable candidate for retention. 
4.2.2.1 Last- Access Component 

The last-access part of the score is highly non-linear 
approximated with piece-wise linear approximation. Con- 
tent is assumed to be most useful in the first eight hours after 
it is accessed. It then ratchets down in usefulness if between 
eight hours and four days has elapsed since the content has 
been last accessed. According to one embodiment, once a 
piece of content has not been accessed in more than four 
days, it is deemed useless unless accessed often prior to that 
most recent access. If a piece of content is never accessed, 
the algorithm assigns an initial last access time equal to the 
time the content arrived in the cache. That is, all pieces of 
content are "accessed" at least once, with that time equal to 
the arrival time. 

The algorithm assigns one point to the last-access com- 
ponent for each minute since the content has been accessed, 
up to a maximum of eight hours worth, or 480 points: 

480=8*60 

For each minute over eight hours but below four days 
since the content has been accessed, the algorithm assigns 
half a point to the last-access component, up to a maximum 
of 3120 points: 

3120=480+(((96 hours-8 hours)* 60 minutes)/2) 

Finally, for each minute over four days since the content 
has been accessed, the algorithm assigns four points to the 
last-access component. The dramatic increase in the slope of 
the function represents our belief that content not accessed 
in the past four days old is very unlikely to be used again 
unless it has been accessed often before that. The increased 
slope is designed to be large enough that content accessed 
once or twice will get a high score and be removed, but small 
enough so that content accessed 10-20 times will reduce the 
score to the point that the object might not be removed. 
There is no maximum on the number of points assigned in 
this manner. For example, at a approximately a month after 
most recent access, the score is: 



480 + 

(((96 hours - 8 hours) " 60 minutes) / 2) + 

(28 days - 4 days) * 24 hours * 60 minutes * 4 - 

141360 



4.2.2.2 Size component p The algorithm is non -linearly 
biased against "large" content. The algorithm divides all 
content up into three zones based on size. The zone bound- 
aries are based on the distribution of content sizes in the 
internet. Zone 1 content is typically HTML, and must be less 
than 16 kilobytes in size. Zone 2 content is typically inline 
images and ranges in size from 16 to 40 kilobytes. Zone 3 
content is typically large images, audio files, and full-motion 
video, and is larger than 40 kilobytes. 
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To approximate the desired non-linear behavior, the algo- 
rithm starts with a "score" equal to the content's size. If the 
content is in zone 2 or 3, the score is increased by the amount 
that the content exceeds 16 KB. If the content is in zone 3, 

s the score is further increased by the amount that the content 
exceeds 40 KB. The algorithm considers high scores as more 
likely for compaction than low scores. As an example, a 
piece of content 12 KB in size gets a score of 12. Apiece of 
content that is 31 KB in size gets a score of 31+15=46. 

10 Finally, a piece of content that is 122 KB in size gets a score 
of 122+ 106+82*310. The reason the algorithm gives a 
disproportionately large score to large content is that it can 
keep far more content in the cache by deleting a single large 
piece of content than a number of smaller pieces of content. 

15 By increasing the score related to size non-linearly, the 
algorithm is far more likely to remove big content than 
smaller content. 

The formula for size-related scoring is thus: 

2Q S=size+max(0, size- 16 K)+max(0, sizc-40 K) 

4.2.2.3 Frequency Component 

The frequency component is also non-linear. The reason 
for the non-linearity is that there is a large class of content 
which is accessed either never or only once. Content that is 
never accessed could, for example, have been looked ahead 
on by the caching server but never actually seen by the user. 
Other content never accessed could include headlines loaded 
by the caching server on behalf of a channel, but never read 
30 by the user. 

Content accessed zero times or one time may be main- 
tained in the cache for an eight hour period and then 
disposed of as quickly as possible. Content accessed more 
than once, however, increases dramatically in importance, 
35 because it probably is not a headline that was read once and 
discarded, but rather a more useful piece of content. 

The algorithm uses a step function to reflect this property 
of incoming content. Content accessed zero times is 
assigned a frequency component term of 1. Content 
accessed once is given a value of 2. Content accessed more 
than once is given a value of the number of accesses plus 3. 
Tne function is therefore linear with a "jump" at two 
accesses. 

4.2.2.4 Complete Algorithm Formula 

According to one embodiment, the complete scoring 
algorithm formula for the compaction score C is: 

C-(S+L)fF 

Where S is the size term, L is the last- access term, and F 
so is the frequency term. 

High scores describe content that should be removed, and 
low scores describe content that should be kept. Ahigh score 
therefore arises either from a combination of a low number 
of accesses, a last- access time far in the past, and a large size. 
55 The algorithm has the following desirable properties: 
the value of L is large compared to that of S as time 
increases; that means the more time that has elapsed since 
the content has been accessed, the less important the con- 
tent's size is, S is most able to affect the score in the first 
60 eight hours since content has been accessed, where the value 
of L is fairly small. 

The larger F, the smaller the score. Most content deserv- 
ing of removal is accessed once (F=2) or not at all (F«l), 
making the frequency component essentially irrelevant and 
65 lending all important to S and L. 

As content is accessed more than once, the value of F 
jumps from 2 to 5 and then linearly after each access. That 
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means even if a 40 K piece of content has gone a week enters the system, data from all lower-priority requests are 

without being accessed, if it was accessed 50 times before ignored. Since low-priority requests have small TCP 

that, its score: windows, amount of data still in the pipe from those 

(64+20400)/53-386 connections is quite small (and completely controlled by 

5 setting the low-priority request window size), and ceasing to 

Is approximately the same as that of the same piece of read the connections causes an immediate drop in the 

content accessed only once in the past eight hours. bandwidth consumed by them, bandwidth which can then be 

(64+4«0)/2-272 taken by ^-priority connections. 

10 

This property is particularly useful for boilerplate graph- As soon as there are no more high-priority requests, the 
ics that are not accessed for a while. Such graphics can be caching server once again begins reading from lower- 
expensive to download, and the algorithm should try keep priority request connections, opening up their windows. The 
them in the cache if they are shared by many web pages. scrvcr docs not begin doing w i^^y. Instcad> it 

4 3 f oSew 6ment ^ aSSUmeS that ° ne ^-Priority request will generate other 

Tne caJhS^rver is responsible for making sure the ^ riori * <" « H ™L page request by a 

user's browsing experience is as good as possible. While browser ^ lead to rec * uests for ^ P a S e s m " hne ma ff*. 

lookahead improves the user experience in the long run, in for example). Every time high-priority traffic travels through 

the short run lookahead can detract from the user's experi- 20 tne server, the server advances a timer. The duration of the 

ence if background lookahead processing is taking up net- timer is a measure of how long the server is willing to wait 

work bandwidth while the user is performing foreground before it decides there are no immediately following high 

browsing operations. The caching server has several mecha- priority requests. Only when the timer expires does the 

nisms to manage bandwidth and optimize use of the net- server begin reading from lower-priority connections, 

work. FIG, 8 is a flow chart illustrating an overview of one 25 

embodiment of bandwidth management. ™ « *» . . , 

4 3 2 Bandwidth sharin There are a number of variations on this algorithm that the 

According ^ one\mbodiment of the present invention, server can em P lov as ^eded. As it measures the number and 

every task performed by the caching server runs at a frequency of high-priority connections, it can further 

particular priority. Request tasks are responsible for taking 30 improve the performance of low-priority connections by 

a request made internally by the caching server or by the opening their windows. When the server believes enough 

browser and satisfying it, either from the cache or from the time has passed that the presence of a high-priority request 

network. If the request is satisfied through the network, it is increasingly likely, it can begin shrinking the size of 

must share bandwidth with other requests in a manner low-priority request windows, so that when a high-priority 

consistent with its priority. Browser requests, for example, 35 request finally does ^ low priority reqiies ts have 

must get more of the available bandwidth than lookahead ^ windows md merefore small of data in (hc 

requests. Aj , 1 A • A pipe and minimal impact on available bandwidth. 

According to one embodiment, a standard sockets inter- r r 

face is used in a special algorithm that implicitly manages 

bandwidth according to request priority. Each TCP connec- 40 Thus, a method and apparatus for storage and delivery of 

tion managed by Windows Sockets has a "window" which documents on the Internet is disclosed. The specific arrange- 

represents the amount of data on that connection that can be ments and methods described herein, are merely illustrative 

in transit before transmission stops and awaits acknowledg- of the principles of the present invention. Numerous modi- 

ment from the destination endpoint. The larger the window, fixations in form and detail may be made by those of 

the more data can be in transit. Incoming TCP data is 45 ordinary m me art without departing from the scope of 

buffered for delivery to the application, which reads the data, ^ mventio n. Although this invention has been shown in 

causing it to be acknowledged to the sender, which in turo 14 . . *• i c a u a- * ■* u u 

tI. • ■ • j & ' relation to a particular preferred embodiment, it should not 

opens the transmission window for more data. , . , , • * « * , . 

By not reading a TCP connection, an application can be f^dered so limited. Rather, the present invention is 

prevent data from being acknowledged and therefore more 50 limited only by the scope of the appended claims, 
data from being transmitted. By selectively reading and not 

reading different TCP connections, the caching server can We claim: 

control the amount of network bandwidth taken by each L A me thod for managing bandwidth between a client 

connection. The only problem with this solution is the lag deWce and a network> said method comprisi tne ste s of . 

between the time the caching server stops reading a con- 55 
nection and the time the sender's window is exhausted. A 

large window can keep sufficiently large amounts of data "in „ . t . . r t 

• * * 7 * . , • allocating a priority to each request from said client 

the pipe that ceasing to read a connection does not lmrae* h * • 

diately lower that connection's share of the bandwidth. evice, 

To solve this problem, each TCP connection's transmis- 60 allocating bandwidth to a low-priority request to be 

sion window is adjusted according to request priority. High- processed; 

priority requite get the largest window recommended processing Mid ^-priority request based on said alio- 

(typically 8192 bytes), and low-pnonty requests get a small r . j • « f. . n . , A A , 

" a \ • ii 4 +1 a xjt • % • cate d bandwidth until a high-priority request demands 

window, typically one or two network Maximum Transmis- a a 

sion Units (MTU). When a low-priority request is running, 65 10 be P rocessed i and 

it will run more slowly, and with less data outstanding, than resuming processing said low-priority request after a 

a high-priority request. As soon as a high-priority request predetermined amount of time following completion of 
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processing said high-priority request, said predeter- 
mined amount of time being calculated to predict that 
said high priority request did not generate other high 
priority requests. 

2. The method of claim 1 wherein the step of processing 
low-priority request comprises the steps of: 

ceasing to process all running low-priority requests upon 
receiving a high-priority request; 

allocating bandwidth to said high-priority request; and 

processing said high-priority request based on said allo- 
cated bandwidth. 
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3. The method of claim 1 wherein the step of resuming 
processing said low-priority request is performed if no 
high-priority requests are running. 

4. The method of claim 1 wherein a request for displaying 
data to a user is allocated a high priority. 

5. The method of claim 1 wherein a request for preloading 
predicted data from a network server to a cache on said 
client device is allocated a low priority. 
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